Building fast refresh capability in Amazon RDS for PostgreSQL

This post is a guest post by CDL. In their own words, CDL is a “leading insurtech company based in the UK, featured in the Financial Times Future 100 listing of high-growth UK companies making an impact on their industry. It has a strong track record in the insurance and financial services sectors, with its solutions powering some of the UK’s most profitable insurance retailers. Transacting over 7 million policies on its systems, CDL’s clients include Sainsbury’s Bank, Tesco Bank, Swinton Insurance, and Moneysupermarket.com.”

This post describes how CDL developed fast refresh capability using materialized view logs in Amazon RDS for PostgreSQL. We detail what we built to track changes and offer an alternative to the complete refresh, which slashed the time needed from hours to seconds. We also share open-source software to enable fast refresh with the wider PostgreSQL community and outline the installation process involved.

The challenge

CDL handles millions of transactions each day. We were looking to migrate our business intelligence (BI) platform from Oracle to PostgreSQL. To make the switch viable, we needed the relational database to handle this volume of change and refresh in near-real time so that customers retain access to up-to-the-minute business intelligence.

PostgreSQL only has the capability of a complete refresh. However, our service-level agreements require data to be refreshed every 15 minutes, and the volume of change transacted by CDL meant that the complete refresh process couldn’t handle the size of our materialized views (MVs) within this timescale. Our largest MV log is over 150 GB, which took hours to build through the complete refresh process, and there can be over 150 views per day, depending on the customer.

We use the MVs on our BI solution to essentially de-normalize the data from our online transaction processing (OLTP) layer into logical data marts for our clients to consume and use to perform data analytics. Fast refresh capability was therefore an essential prerequisite for CDL when we switched from Oracle to PostgreSQL.

Fast refresh vs. complete refresh

Fast refresh uses materialized view logs on the underlying tables to keep track of changes, and only the changes since the last refresh are applied to the MV. In contrast, the complete refresh process refreshes all the data and could inevitably take hours for large datasets.

The following diagram illustrates how we use MVs in our solution: we sync our data from the OLTP layer into an RDS instance using Attunity Replicate. We then build the MVs in the RDS instance from the OLTP source tables and refresh the jobs every 15 minutes using refresh schedules.

Developing the process for PostgreSQL

The fast refresh process was designed to be installed into its own schema that contains the functions needed to run the MV process, with three data dictionary tables and one role.

The workflow for the MV log creation in the following diagram is as follows:

Add the UUID column on the source table
Create the MV log for the source table
Add indexes on MV log table
Create the trigger on the source table
Insert the entry into the pgmview_logs table

The workflow for the MV creation shown in the following diagram is as follows:

Deconstruct the SQL statement
Extract the compound MV tables
Create the base table from MV SQL
Add the rowid to the source table
Add index to rowid column
Insert into the pgmviews table
Insert in the pgmviews_details table
Complete refresh on the MV

There are four main functions a user runs:

mv$createMaterializedViewlog – Creates the MV logs
mv$createMaterializedView – Creates the MV
mv$removeMaterializedViewLog – Removes the MV logs
mv$removeMaterializedView – Removes the MV

Source code

Download the source code from the GitHub repo. See the following code:

git clone https://github.com/CDLSoftware/pg-mv-fast-refresh.git

Installing the fast refresh functions

The install of the fast refresh functions is designed to live in its own schema in the database that is specified via the MODULEOWNER parameter.

To install the MV code, complete the following steps:

Navigate to the folder where you downloaded the repo and edit the module_set_variable.sh file.
This is where all the variables are stored for where you want to install the fast refresh functions.
You don’t need the SOURCEUSERNAME/SOURCEPASSWORD and MVUSERNAME/MVPASSWORD parameters to install the fast refresh functions; you use them for the test harness setup. See the following code:

cd pg-mv-fast-refresh
vi module_set_variables.sh

MODULEOWNER=<MODULEOWNER> - The module owner username
MODULE_HOME=<MODULE_HOME> - The Module home path 
MODULEOWNERPASS=<MODULEOWNERPASS> - Password for module owner PGRS_MVIEW
HOSTNAME=<HOSTNAME> - Hostname for database
PORT=<PORT>	 - port for database
DBNAME=<DBNAME>	 - Database Name
PGUSERNAME=<PGUSERNAME> - DB username for the module installation run
PGPASSWORD=<PGPASSWORD> - DB username password for the module installation run
SOURCEUSERNAME=<SOURCEUSERNAME> - DB username for the source tables for the MV
SOURCEPASSWORD=<SOURCEPASSWORD> - DB password for the source tables user
MVUSERNAME=<MVUSERNAME> - DB username for the MV owner
MVPASSWORD=<MVPASSWORD> - DB password for the MV owner
LOG_FILE=<LOG_PATH> - Path to logfile output location

As an example of the parameter settings used in the test case, you have an RDS instance pg-tonytest.citnv5igau2v.eu-west-1.rds.amazonaws.com with a database testpoc and a master username dbamin.
The install package installs the fast refresh functions under the schema testpoc.
You have a source schema testpocsource. This is where the source data tables go for the test harness and a testpocmv, which is the schema where the MV is built. See the following code:

ort MODULEOWNER=testpoc
export MODULE_HOME=/var/lib/pgsql/pg-mv-fast-refresh
export MODULEOWNERPASS=testpoc123
export HOSTNAME=pg-tonytest.citnv5igau2v.eu-west-1.rds.amazonaws.com
export PORT=5432
export DBNAME=testpoc
export PGUSERNAME=dbadmin
export PGPASSWORD=testpoc123
export SOURCEUSERNAME=testpocsource
export SOURCEPASSWORD=testpoc123
export MVUSERNAME=testpocmv
export MVPASSWORD=testpoc123
export LOG_FILE=/tmp/fast_refresh_module_install_`date +%Y%m%d-%H%M`.log

Change the permissions on the script runCreateFastRefreshModule.sh to execute and then run. See the following code:
```
chmod 700 runCreateFastRefreshModule.sh
./runCreateFastRefreshModule.sh
```
This should just take seconds to run.

When it’s complete, check the log file in the location you set.
For this post, the log file is in /tmp. The status is at the bottom. See the following example code:

-bash-4.1$ cat fast_refresh_module_install_20190913-1358.log
INFO: Set variables
INFO: LOG_FILE parameter set to /tmp/fast_refresh_module_install_20190913-1358.log
INFO: MODULEOWNER parameter set to testpoc
INFO: PGUSERNAME parameter set to dbadmin
INFO: HOSTNAME parameter set to pg-tonytest.citnv5igau2v.eu-west-1.rds.amazonaws.com
INFO: PORT parameter set to 5432
INFO: DBNAME parameter set to strata
INFO: MODULE_HOME parameter set to /var/lib/pgsql/pg-mv-fast-refresh
INFO: Run testpoc schema build script
INFO: Connect to postgres database strata via PSQL session
…….. cut lines………..
GRANT
INFO: Running Module Deployment Error Checks
INFO: All Objects compiled successfully
INFO: No Errors Found
INFO: Completed Module Deployment Error Checks

Seeing the functions in the database

If you connect to the database on which you installed the package, you should see the functions under the MODULEOWNER. For this post, the functions are under testpoc.

Installing the test harness

The following use case demonstrates how to create a module view (MV). You create six tables and insert some data into the tables. All the following steps are in the create_test_harness.sh script and use the variables set in module_set_variables.sh.

Create a schema for the test data. See the following code:

psql --host=$HOSTNAME --port=$PORT --username=$PGUSERNAME --dbname=$DBNAME

CREATE USER $SOURCEUSERNAME WITH
LOGIN
NOSUPERUSER
NOCREATEDB
NOCREATEROLE
INHERIT
NOREPLICATION
CONNECTION LIMIT -1
PASSWORD '$SOURCEPASSWORD';

GRANT ALL PRIVILEGES ON DATABASE "$DBNAME" to $SOURCEUSERNAME;
GRANT $SOURCEUSERNAME to $PGUSERNAME;
CREATE SCHEMA $SOURCEUSERNAME AUTHORIZATION $SOURCEUSERNAME;
GRANT ALL PRIVILEGES ON SCHEMA $SOURCEUSERNAME to $PGUSERNAME;
GRANT $SOURCEUSERNAME to $MODULEOWNER;
GRANT USAGE ON SCHEMA $SOURCEUSERNAME TO $MODULEOWNER;
GRANT ALL ON SCHEMA $MODULEOWNER TO $SOURCEUSERNAME;

Create a schema for the MV. See the following code:

psql --host=$HOSTNAME --port=$PORT --username=$PGUSERNAME --dbname=$DBNAME

CREATE USER $MVUSERNAME WITH
LOGIN
NOSUPERUSER
NOCREATEDB
NOCREATEROLE
INHERIT
NOREPLICATION
CONNECTION LIMIT -1
PASSWORD '$MVPASSWORD';

GRANT $MVUSERNAME to $PGUSERNAME;
CREATE SCHEMA IF NOT EXISTS $MVUSERNAME AUTHORIZATION $MVUSERNAME;
GRANT ALL ON SCHEMA $MVUSERNAME TO $MVUSERNAME;
GRANT SELECT ON ALL TABLES IN SCHEMA $SOURCEUSERNAME TO $MVUSERNAME;
GRANT pgmv\$_role TO $MVUSERNAME;
ALTER DATABASE $DBNAME SET SEARCH_PATH=public,$MODULEOWNER,$MVUSERNAME,$SOURCEUSERNAME;
GRANT $SOURCEUSERNAME TO $MODULEOWNER;
GRANT USAGE ON SCHEMA $SOURCEUSERNAME TO $MODULEOWNER;
GRANT ALL PRIVILEGES ON DATABASE strata TO $MODULEOWNER;
GRANT ALL ON SCHEMA $MVUSERNAME  TO $MODULEOWNER;
GRANT USAGE ON SCHEMA $MVUSERNAME  TO $MODULEOWNER;
GRANT $MVUSERNAME  TO $MODULEOWNER;
GRANT $MODULEOWNER TO $MVUSERNAME;
GRANT USAGE ON SCHEMA $MODULEOWNER TO $MVUSERNAME;
GRANT ALL ON SCHEMA $SOURCEUSERNAME TO $MODULEOWNER;

Create the test data as the sourceusername. See the following code:

psql --host=$HOSTNAME --port=$PORT --username=$SOURCEUSERNAME --dbname=$DBNAME

-- create t1 table

CREATE TABLE $SOURCEUSERNAME.t1
(
    id numeric NOT NULL,
    lookup_id numeric,
    code character varying(10) COLLATE pg_catalog."default",
    CONSTRAINT t1_pkey PRIMARY KEY (id)
);

-- create t2 table

CREATE TABLE $SOURCEUSERNAME.t2
(
    id numeric NOT NULL,
    description character varying(100) COLLATE pg_catalog."default",
    metavals_id numeric,
    age integer NOT NULL,
    CONSTRAINT t2_pkey PRIMARY KEY (id)
);

-- create t3 table

CREATE TABLE $SOURCEUSERNAME.t3
(
    lookup_id numeric NOT NULL,
    lookup_code character varying(10) COLLATE pg_catalog."default" NOT NULL,
    lookup_description character varying(50) COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT t3_pkey PRIMARY KEY (lookup_id)
);

-- create t4 table

CREATE TABLE $SOURCEUSERNAME.t4
(
    metavals_id numeric NOT NULL,
    code character varying(10) COLLATE pg_catalog."default" NOT NULL,
    description character varying(30) COLLATE pg_catalog."default",
    CONSTRAINT t4_pkey PRIMARY KEY (metavals_id)
);

-- create t5 table

CREATE TABLE $SOURCEUSERNAME.t5
(
    id numeric NOT NULL,
    rep_ind character varying(1) COLLATE pg_catalog."default",
    trans_id numeric,
    CONSTRAINT t5_pkey PRIMARY KEY (id)
);

-- create t6 table

CREATE TABLE $SOURCEUSERNAME.t6
(
    trans_id numeric NOT NULL,
    payment_reference character varying(20) COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT t6_pkey PRIMARY KEY (trans_id)
);

-- insert records into t1 table

INSERT INTO $SOURCEUSERNAME.t1(
	id, lookup_id, code)
	VALUES (1, 10, 'hello');

INSERT INTO $SOURCEUSERNAME.t1(
	id, lookup_id, code)
	VALUES (2, 20, 'bye');

INSERT INTO $SOURCEUSERNAME.t1(
	id, lookup_id, code)
	VALUES (3, 30, 'cya');

INSERT INTO $SOURCEUSERNAME.t1(
	id, lookup_id, code)
	VALUES (4, 50, 'goodbye');

INSERT INTO $SOURCEUSERNAME.t1(
	id, lookup_id, code)
	VALUES (5, 50, 'hi');

INSERT INTO $SOURCEUSERNAME.t1(
	id, lookup_id, code)
	VALUES (6, 20, 'bye');

-- insert records into t2 table

INSERT INTO $SOURCEUSERNAME.t2(
	id, description, metavals_id, age)
	VALUES (1, 'house', 100, 20);

INSERT INTO $SOURCEUSERNAME.t2(
	id, description, metavals_id, age)
	VALUES (2, 'flat', 200, 35);

INSERT INTO $SOURCEUSERNAME.t2(
	id, description, metavals_id, age)
	VALUES (3, 'bungalow', 300, 30);

INSERT INTO $SOURCEUSERNAME.t2(
	id, description, metavals_id, age)
	VALUES (4, 'palace', 300, 30);

INSERT INTO $SOURCEUSERNAME.t2(
	id, description, metavals_id, age)
	VALUES (5, 'office', 400, 50);

-- insert records into t3 table

INSERT INTO $SOURCEUSERNAME.t3(
	lookup_id, lookup_code, lookup_description)
	VALUES (10, 'ENG', 'ENGLAND');

INSERT INTO $SOURCEUSERNAME.t3(
	lookup_id, lookup_code, lookup_description)
	VALUES (20, 'WAL', 'WALES');

INSERT INTO $SOURCEUSERNAME.t3(
	lookup_id, lookup_code, lookup_description)
	VALUES (30, 'SCO', 'SCOTLAND');

INSERT INTO $SOURCEUSERNAME.t3(
	lookup_id, lookup_code, lookup_description)
	VALUES (40, 'IRE', 'IRELAND');

INSERT INTO $SOURCEUSERNAME.t3(
	lookup_id, lookup_code, lookup_description)
	VALUES (50, 'FRA', 'FRANCE');

-- insert records into t4 table

INSERT INTO $SOURCEUSERNAME.t4(
	metavals_id, code, description)
	VALUES (100,'CHAIR','SMALL CHAIR');

INSERT INTO $SOURCEUSERNAME.t4(
	metavals_id, code, description)
	VALUES (200,'TABLE','SMALL TABLE');

INSERT INTO $SOURCEUSERNAME.t4(
	metavals_id, code, description)
	VALUES (300,'LIGHT','BRIGHT LIGHT');

INSERT INTO $SOURCEUSERNAME.t4(
	metavals_id, code, description)
	VALUES (400,'BED','KING SIZE BED');

INSERT INTO $SOURCEUSERNAME.t4(
	metavals_id, code, description)
	VALUES (500,'CUPBOARD','BEDSIDE CUPBOARD');

-- insert records into t5 table

INSERT INTO $SOURCEUSERNAME.t5(
	id, rep_ind, trans_id)
	VALUES (1, 'Y', 1000);

INSERT INTO $SOURCEUSERNAME.t5(
	id, rep_ind, trans_id)
	VALUES (2, 'Y', 2000);

INSERT INTO $SOURCEUSERNAME.t5(
	id, rep_ind, trans_id)
	VALUES (3, 'N', 3000);

INSERT INTO $SOURCEUSERNAME.t5(
	id, rep_ind, trans_id)
	VALUES (4, 'Y', 4000);

INSERT INTO $SOURCEUSERNAME.t5(
	id, rep_ind, trans_id)
	VALUES (5, 'N', 5000);

-- insert records into t6 table

INSERT INTO $SOURCEUSERNAME.t6(
	trans_id, payment_reference)
	VALUES (1000, 'GZ-1000');

INSERT INTO $SOURCEUSERNAME.t6(
	trans_id, payment_reference)
	VALUES (2000, 'AZ-2000');

INSERT INTO $SOURCEUSERNAME.t6(
	trans_id, payment_reference)
	VALUES (3000, 'BZ-3000');

INSERT INTO $SOURCEUSERNAME.t6(
	trans_id, payment_reference)
	VALUES (4000, 'QZ-4000');

INSERT INTO $SOURCEUSERNAME.t6(
	trans_id, payment_reference)
	VALUES (5000, 'VZ-5000');

Create the MV logs. See the following code:

psql --host=$HOSTNAME --port=$PORT --username=$SOURCEUSERNAME --dbname=$DBNAME

DO
\$do\$
DECLARE
    cResult CHAR(1) := NULL;
BEGIN
    cResult := $MODULEOWNER.mv\$createMaterializedViewlog( 't1','$SOURCEUSERNAME');
    cResult := $MODULEOWNER.mv\$createMaterializedViewlog( 't2','$SOURCEUSERNAME');
    cResult := $MODULEOWNER.mv\$createMaterializedViewlog( 't3','$SOURCEUSERNAME');
    cResult := $MODULEOWNER.mv\$createMaterializedViewlog( 't4','$SOURCEUSERNAME');
    cResult := $MODULEOWNER.mv\$createMaterializedViewlog( 't5','$SOURCEUSERNAME');
    cResult := $MODULEOWNER.mv\$createMaterializedViewlog( 't6','$SOURCEUSERNAME');

END
\$do\$;

Create the MV. See the following code:

psql --host=$HOSTNAME --port=$PORT --username=$MVUSERNAME --dbname=$DBNAME

DO
\$do\$
DECLARE
    tStartTime      TIMESTAMP   := clock_timestamp();
    cResult         CHAR(1)     := NULL;
    pSqlStatement   TEXT;
BEGIN

    pSqlStatement := '
SELECT t1.id t1_id,
t1.lookup_id t1_lookup_id,
t1.code t1_code,
t2.id t2_id,
t2.description t2_desc,
t2.metavals_id t2_metavals_id,
t2.age t2_age,
t3.lookup_id t3_lookup_id,
t3.lookup_code t3_lookup_code,
t3.lookup_description t3_lookup_desc,
t4.metavals_id t4_metavals_id,
t4.code t4_code,
t4.description t4_desc,
t5.id t5_id,
t5.rep_ind t5_rep_ind,
t5.trans_id t5_trans_id,
t6.trans_id t6_trans_id,
t6.payment_reference t6_payment_ref
FROM
t1
INNER JOIN t2 ON t1.id = t2.id
LEFT JOIN t3 ON t1.lookup_id = t3.lookup_id
LEFT JOIN t4 ON t2.metavals_id = t4.metavals_id
INNER JOIN t5 ON t1.id = t5.id
LEFT JOIN t6 ON t5.trans_id = t6.trans_id';
    cResult := mv\$createMaterializedView
    (
        pViewName           => 'mv_fast_refresh_funct_test',
        pSelectStatement    =>  pSqlStatement,
        pOwner              => '$MVUSERNAME',
        pFastRefresh        =>  TRUE
    );
    RAISE NOTICE 'Complex Materialized View creation took % % %', clock_timestamp() - tStartTime, chr(10), chr(10);
END
\$do\$;

Select data from the MV. See the following code:

strata=> select t1_id, t1_lookup_id, t1_code, t2_id, t2_desc, t3_lookup_desc from mv_fast_refresh_funct_test order by t1_id;
 t1_id | t1_lookup_id | t1_code | t2_id | t2_desc  | t3_lookup_desc
-------+--------------+---------+-------+----------+----------------
     1 |           10 | hello   |     1 | house    | ENGLAND
     2 |           20 | bye     |     2 | flat     | WALES
     3 |           30 | cya     |     3 | bungalow | SCOTLAND
     4 |           50 | goodbye |     4 | palace   | FRANCE
     5 |           50 | hi      |     5 | office   | FRANCE
(5 rows)

Updating the test harness

This is a very simple test to show how the fast refresh process works. For this post, you update a row in the base table t1 and see how that is reflected in the MV after a fast refresh. This test script is update_mv_test.sh.

Update a row for table t1 with the code UPDATE testpocsource.t1 set code='yo' where code='hello':

INFO: Changing a row for table t1
UPDATE testpocsource.t1 set code='yo' where code='hello'
Press the enter key to do the update...
UPDATE 1

Check that the MV the data isn’t changed (because it has not been refreshed). The following code shows that it is still set to hello:

INFO: Check the output from the MV
Select * from mv_fast_refresh_funct_test order by t1_id;
Press the enter key to see the MV output...
 t1_id | t1_lookup_id | t1_code | t2_id | t2_desc  | t3_lookup_desc
-------+--------------+---------+-------+----------+----------------
     1 |           10 | hello   |     1 | house    | ENGLAND
     2 |           20 | bye     |     2 | flat     | WALES
     3 |           30 | cya     |     3 | bungalow | SCOTLAND
     4 |           50 | goodbye |     4 | palace   | FRANCE
     5 |           50 | hi      |     5 | office   | FRANCE
(5 rows)

Do a fast refresh on the MV and check it again. The following code shows that the row has changed:

INFO: Let’s do a MV fast refresh to sync the MV with the change
Press the enter key to do the MV refresh...
NOTICE:  Fast Snapshot Refresh took 00:00:00.055291


CONTEXT:  PL/pgSQL function inline_code_block line 12 at RAISE
DO
INFO: Now Check the output from the MV the code for t1_id is now yo
Press the enter key to see the MV output...
 t1_id | t1_lookup_id | t1_code | t2_id | t2_desc  | t3_lookup_desc
-------+--------------+---------+-------+----------+----------------
     1 |           10 | yo      |     1 | house    | ENGLAND
     2 |           20 | bye     |     2 | flat     | WALES
     3 |           30 | cya     |     3 | bungalow | SCOTLAND
     4 |           50 | goodbye |     4 | palace   | FRANCE
     5 |           50 | hi      |     5 | office   | FRANCE
(5 rows)

Change the row back to hello by updating the t1 table and refreshing the view again. See the following code:

INFO: Now lets change the code back to hello
UPDATE testpocsource.t1 set code='hello' where code='yo'
Press the enter key to do the update...
UPDATE 1
INFO: Lets do a MV fast refresh to sync the MV with the change
Press the enter key to do the MV refresh...
NOTICE:  Fast Snapshot Refresh took 00:00:00.021894


CONTEXT:  PL/pgSQL function inline_code_block line 12 at RAISE
DO
INFO: Now Check the output from the MV the code for t1_id its back to hello
Press the enter key to see the MV output...
 t1_id | t1_lookup_id | t1_code | t2_id | t2_desc  | t3_lookup_desc
-------+--------------+---------+-------+----------+----------------
     1 |           10 | hello   |     1 | house    | ENGLAND
     2 |           20 | bye     |     2 | flat     | WALES
     3 |           30 | cya     |     3 | bungalow | SCOTLAND
     4 |           50 | goodbye |     4 | palace   | FRANCE
     5 |           50 | hi      |     5 | office   | FRANCE
(5 rows)

Removing the test harness

To remove the test harness, run the drop_test_harness.sh script. This removes the data and MV schemas from the database. See the following code:

-bash-4.2$ ./drop_test_harness.sh
INFO: Drop Complete check logfile for status - /tmp/test_harness_drop_20191217-1300.log

Removing the fast refresh functions

To remove the fast refresh functions from the database, run the dropFastRefreshModule.sh script. This prompts you to enter Y to proceed with the removal of the objects. See the following code:

-bash-4.2$ ./dropFastRefreshModule.sh
Are you sure you want to remove the module schema - testpoc (y/n)?y
yes selected the schemas - testpoc will be dropped

Conclusion

This post reviewed the fast refresh module and provided a simple example of getting an MV up and running. To enable our customers to access real-time business intelligence, we had to find a unique solution. We hope you enjoyed learning about how to develop fast refresh capabilities using MV logs with PostgreSQL. You can download the code from the GitHub repo.

Stay tuned for part two of this series, where we’ll discuss the module and results we have seen in running this solution in the production environment.

About the Author

Tony Mullen is a Principal Data Engineer at CDL. He has over 18 years’ experience in databases; primarily Oracle, with a shift towards PostgreSQL over recent years. He heads up the migration of key applications from Oracle on-prem to Amazon RDS PostgreSQL, and has a keen interest in building the PostgreSQL community in Manchester, co-hosting the PostgreSQL Manchester meet-up.