AWS Database Blog

Impactful features in PostgreSQL 15

PostgreSQL is one of the most popular open-source relational database systems. The product of more than 30 years of development work, PostgreSQL has proven to be a highly reliable and robust database that can handle a large number of complicated data workloads. AWS offers services that make PostgreSQL database deployments straightforward to set up, manage, and scale for the cloud in a cost-efficient manner. These services are Amazon Relational Database Service (Amazon RDS) for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition.

The PostgreSQL developer community released PostgreSQL 15 about a year ago, which has given us time to study the impactful features of the release. PostgreSQL 15 further enhances the performance gains of recent releases, offering significant improvements for managing workloads in local and distributed environments, particularly in sorting operations. Inclusion of the popular MERGE command enhances the developer experience, while additional capabilities for observing the database’s state provide more insights.

In this post, we take a close look at some of the most impactful features of PostgreSQL 15. We explore the MERGE statement, new WAL compression algorithms, enhancements to logical replication and PUBLIC schema permissions, improvements to partition tables, and some general performance improvements.

Prerequisites

To follow along with this post, complete the following prerequisites:

  1. Create an Aurora PostgreSQL cluster or RDS for PostgreSQL instance if you don’t already have one. For instructions, refer to Create an Amazon Aurora PostgreSQL-Compatible DB cluster or Create a PostgreSQL DB instance, respectively.
  2. Create an Amazon Elastic Compute Cloud (Amazon EC2) instance to install the PostgreSQL client to access the Aurora PostgreSQL or RDS for PostgreSQL instance. For instructions, refer to Create your EC2 resources and launch your EC2 instance. Or you can set up connectivity between your RDS database and EC2 compute instance in one click.
  3. Install the PostgreSQL client. On Amazon Linux 2023, you can use the following commands to download the psql command line tool:
sudo dnf install postgresql15-client

MERGE command

MERGE was one of the most highly anticipated features in PostgreSQL 15 because it provides a single command that lets you perform conditional inserts, updates, and deletes. Prior to PostgreSQL 15, you could use the INSERT ... ON CONFLICT statement to run upsert-style queries, but the target row needed to have a unique or exclusion constraint. With MERGE, you now have more options for modifying rows from a single SQL statement, including inserts, updates, and deletes. It simplifies data manipulation and enhances the overall developer experience by reducing complexity and optimizing database interactions. This command is used to merge data from a source table into a target table, allowing you to update or insert data into an existing table based on a match between the source and target tables.

Usage of the MERGE command

The basic syntax for the MERGE command is as follows:

MERGE INTO target_table USING source_table ON join_condition
WHEN MATCHED THEN UPDATE SET column1=value1, column2=value2, ...
WHEN NOT MATCHED THEN INSERT (column1, column2, ...) VALUES (value1, value2, ...)

Let’s look at the explanation of the syntax:

  • MERGE INTO target_table – Specifies the target table that you want to merge data into.
  • USING source_table – Specifies the source table that you want to merge data from.
  • ON join_condition – Specifies the join condition that determines which rows in the source table match which rows in the target table.
  • WHEN MATCHED THEN UPDATE – Specifies the action to take when a match is found between a row in the source table and a row in the target table. In this case, the specified columns in the target table will be updated with their respective specified values from the source table.
  • WHEN NOT MATCHED THEN INSERT – Specifies the action to take when no match is found between a row in the source table and a row in the target table. In this case, a new row will be inserted into the target table with the specified values.

Example usage

Let’s explore an example of a source table calledemployee_updatesand a target table calledemployees. You can connect to your EC2 instance and then to a PostgreSQL database engine running on the instance.

  1. Create the employees table:
    CREATE TABLEemployees (
        id SERIAL PRIMARY KEY,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        salary NUMERIC NOT NULL
    );
  2. Insert some initial data into the employees table:
    INSERT INTO employees (first_name, last_name, salary) VALUES ('John', 'Doe', 50000),
           ('Jane', 'Doe', 60000),
           ('Bob', 'Smith', 70000);
    
  3. Create the employee_updates table:
    CREATE TABLE employee_updates (
        id INTEGER PRIMARY KEY,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        salary NUMERIC NOT NULL
    );
    
  4. Insert some updated data into the employee_updates table:
    INSERT INTO employee_updates (id, first_name, last_name, salary) VALUES (1, 'Johnny', 'Doe', 55000),
           (2, 'Janie', 'Doe', 65000),
           (4, 'Alice', 'Jones', 80000);
    
  5. Use the MERGE command to update the data in the target table:
    MERGE INTO employees
    USING employee_updates ON employees.id = employee_updates.id
    WHEN MATCHED THEN
        UPDATE SET
            first_name = employee_updates.first_name, 
            last_name = employee_updates.last_name, 
            salary = employee_updates.salary
    WHEN NOT MATCHED THEN
        INSERT (id, first_name, last_name, salary)
            VALUES (employee_updates.id, employee_updates.first_name, 
                   employee_updates.last_name, employee_updates.salary);
    

    The MERGE command is used to update and insert data from the employee_updates table into the employees table. It uses the following clauses:

    • ON – Specifies that we want to match rows in the employees table with rows in the employee_updates table based on the id column
    • WHEN MATCHED – Specifies that we want to update the first_name, last_name, and salary columns in the employees table with the corresponding values from the employee_updates table
    • WHEN NOT MATCHED – Specifies that we want to insert new rows into the employees table for any rows in the employee_updates table that don’t have a matching id in the employees table
  6. Finally, verify that the data has been updated and inserted correctly by running a SELECT statement on the employees table:
    postgres=> SELECT * FROM employees;
     id | first_name | last_name | salary
    ----+------------+-----------+--------
      3 | Bob        | Smith     |  70000
      1 | Johnny     | Doe       |  55000
      2 | Janie      | Doe       |  65000
      4 | Alice      | Jones     |  80000
    (4 rows)
    

Improve MERGE performance with work_mem

PostgreSQL employs a temporary work area in memory (which can be set through the work_mem parameter) for various operations, such as storing intermediate results during index creation, sorting, and grouping records for queries. Therefore, the performance of MERGE can be improved by increasing the work_mem. The following is a simple example with a db.r5.large instance:

  1. Create two sample tables and insert some dummy data:
    postgres=> CREATE TABLE merge_target (a int, b int);
    CREATE TABLE
    postgres=> CREATE TABLE merge_source (a int, b int);
    CREATE TABLE
    postgres=> INSERT INTO merge_target SELECT i, i*10 FROM generate_series(1,10000000,2) i;
    INSERT 0 5000000
    postgres=> INSERT INTO merge_source SELECT i, i*10 FROM generate_series(1,10000000,1) i;
    INSERT 0 10000000
    postgres=> ANALYZE merge_source;
    ANALYZE
    postgres=> ANALYZE merge_target;
    ANALYZE
    
  2. Get the backend process ID of the connection and run the MERGE command:
    postgres=> SELECT pg_backend_pid();
     pg_backend_pid
    ----------------
               4561
    (1 row)
    
    Time: 44.389 ms
    postgres=>
    postgres=> EXPLAIN ANALYZE 
    MERGE INTO merge_target t USING merge_source s ON t.a = s.a
    WHEN MATCHED THEN
        UPDATE SET b = t.b + 1;
                                                                     QUERY PLAN
    ---------------------------------------------------------------------------------------------------------------------------------------------
     Merge on merge_target t  (cost=159039.00..543327.57 rows=0 width=0) (actual time=18349.074..18349.077 rows=0 loops=1)
       Tuples: updated=5000000
       ->  Hash Join  (cost=159039.00..543327.57 rows=5000000 width=6) (actual time=1692.346..7818.920 rows=5000000 loops=1)
             Hash Cond: (s.a = t.a)
             ->  Seq Scan on merge_source s  (cost=0.00..144247.77 rows=9999977 width=4) (actual time=0.012..1406.858 rows=10000000 loops=1)
             ->  Hash  (cost=72124.00..72124.00 rows=5000000 width=10) (actual time=1691.559..1691.560 rows=5000000 loops=1)
                   Buckets: 262144  Batches: 64  Memory Usage: 5422kB
                   ->  Seq Scan on merge_target t  (cost=0.00..72124.00 rows=5000000 width=10) (actual time=0.007..691.606 rows=5000000 loops=1)
     Planning Time: 0.159 ms
     Execution Time: 18349.114 ms
    (10 rows)
    

    In this example, it took 18.3 seconds to process the MERGE command.

  3. Download the recent postgresql log and check for the process ID:
    % grep "pgsql_tmp4561" postgresql.log.2023-08-09-22
    2023-08-09 22:31:44 UTC:72.182.215.11(54319):postgres@postgres:[4561]:LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4561.0", size 70000000
    2023-08-09 22:31:55 UTC:72.182.215.11(54319):postgres@postgres:[4561]:LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4561.1", size 140000000
    2023-08-09 22:32:02 UTC:72.182.215.11(54319):postgres@postgres:[4561]:LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4561.38", size 2500672
    2023-08-09 22:32:02 UTC:72.182.215.11(54319):postgres@postgres:[4561]:LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4561.94", size 3752496
    2023-08-09 22:32:02 UTC:72.182.215.11(54319):postgres@postgres:[4561]:LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp4561.20", size 2497920
    .
    .
    
     % grep "pgsql_tmp4561" postgresql.log.2023-08-09-22|wc -l
         128
    
     % grep "pgsql_tmp4561" postgresql.log.2023-08-09-22|awk -F' ' '{ sum += $9/1024/1024 } END { print sum }'
    575.754
    

    The logs in the preceding code show that running MERGE created 128 temporary files of total 575 MB in size to process the MERGE command

  4. You can increase the work_mem at the session level and try the MERGE command:
    postgres=> SELECT pg_backend_pid();
     pg_backend_pid
    ----------------
               4639
    (1 row)
    
    Time: 43.541 ms
    postgres=> SET work_mem TO '512MB';
    SET
    Time: 43.861 ms
    
    postgres=> EXPLAIN ANALYZE MERGE INTO merge_target t USING merge_source s ON t.a = s.a
    WHEN MATCHED THEN
       UPDATE SET b = t.b + 1;
                                                                        QUERY PLAN
    ---------------------------------------------------------------------------------------------------------------------------------------------
     Merge on merge_target t  (cost=156748.00..438496.00 rows=0 width=0) (actual time=14787.835..14787.838 rows=0 loops=1)
       Tuples: updated=5000000
       ->  Hash Join  (cost=156748.00..438496.00 rows=5000000 width=6) (actual time=1631.768..7072.285 rows=5000000 loops=1)
             Hash Cond: (s.a = t.a)
             ->  Seq Scan on merge_source s  (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.005..745.117 rows=10000000 loops=1)
             ->  Hash  (cost=94248.00..94248.00 rows=5000000 width=10) (actual time=1622.319..1622.320 rows=5000000 loops=1)
                   Buckets: 8388608  Batches: 1  Memory Usage: 280380kB
                   ->  Seq Scan on merge_target t  (cost=0.00..94248.00 rows=5000000 width=10) (actual time=0.024..496.066 rows=5000000 loops=1)
     Planning Time: 0.104 ms
     Execution Time: 14796.642 ms
    (10 rows)
    
    % grep "pgsql_tmp4639" postgresql.log.2023-08-09-22|wc -l
           0
     %
    

    As you can see, it took 14.7 seconds, where it took 18.3 seconds without setting work_mem (a 20% improvement) and no temporary files were created because the operation was done in memory.

Accelerate MERGE queries with Amazon RDS Optimized Reads

Without setting the working memory, PostgreSQL uses the work area on disk instead of memory, which degrades the performance of MERGE. However, configuring the working memory to high values may not necessarily be the optimal solution.

To deal with these kinds of situations, a new deployment option called Optimized Reads is introduced in Amazon RDS, which helps with up to twice the read performance for workloads that heavily rely on a temporary work area. Optimized Reads uses the local storage provided by the NVMe SSDs on the underlying instances used in the Multi-AZ DB clusters. Refer to Introducing Optimized Reads for Amazon RDS for PostgreSQL for more details.

Let’s try the same example with a db.r6gd.large instance type, which offers local NVMe-based SSD block-level storage:

Note: We are running this query with default work_mem (4MB). As, we want to understand the improvement we are going to get with Optimized Reads feature and without changing work_mem defaults.

postgres=> SELECT pg_backend_pid();
 pg_backend_pid
----------------
            727
(1 row)


postgres=> EXPLAIN ANALYZE  
MERGE INTO merge_target t USING merge_source s ON t.a = s.a
WHEN MATCHED THEN
   UPDATE SET b = t.b + 1;

                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Merge on merge_target t  (cost=181025.25..565239.14 rows=0 width=0) (actual time=15767.081..15767.084 rows=0 loops=1)
   Tuples: updated=5000000
   ->  Hash Join  (cost=181025.25..565239.14 rows=4994989 width=6) (actual time=1332.803..7212.402 rows=5000000 loops=1)
         Hash Cond: (s.a = t.a)
         ->  Seq Scan on merge_source s  (cost=0.00..144248.00 rows=10000000 width=4) (actual time=0.008..1041.362 rows=10000000 loops=1)
         ->  Hash  (cost=94197.89..94197.89 rows=4994989 width=10) (actual time=1331.423..1331.424 rows=5000000 loops=1)
               Buckets: 262144  Batches: 64  Memory Usage: 5422kB
               ->  Seq Scan on merge_target t  (cost=0.00..94197.89 rows=4994989 width=10) (actual time=7.399..551.330 rows=5000000 loops=1)
 Planning Time: 0.353 ms
 Execution Time: 15767.139 ms
(10 rows)

% grep "pgsql_tmp727" postgresql.log.2023-08-09-23|wc -l
     128
% grep "pgsql_tmp727" postgresql.log.2023-08-09-23|awk -F' ' '{ sum += $9/1024/1024 } END { print sum }'
575.754

With the default work_mem setting of 4 MB, the db.r6gd.large instance took 15.7 seconds, compared to 18.3 seconds with a db.r5.large instance (a 15% improvement) and the same amount of temporary files creation (128 files and 575 MB in size).

WAL compression (zstd and lz4)

PostgreSQL 15 includes improvements to Write-Ahead Logging (WAL) compression, with the introduction of the zstd and lz4 compression algorithms. This allows for more efficient logging of transactions, resulting in improved performance, reduced disk space usage, and faster recovery.

The lz4 and zstd WAL compression methods are supported in Amazon RDS for PostgreSQL 15 and the default WAL compression is zstd. Due to the unique distributed storage system of Aurora, Amazon Aurora PostgreSQL version 15 does not support server-side compression with Gzip, LZ4, or Zstandard (zstd) using pg_basebackup, online backups using pg_backup_start() and pg_backup_stop(), and prefetching during WAL recovery.

The zstd algorithm is a high-performance data compression algorithm, similar to gzip or bzip2, but often with better compression ratios and faster compression speeds. It’s designed for applications that require high write throughput. The lz4 algorithm is a high-speed data compression algorithm, designed to provide faster speeds than gzip or bzip2, while still providing good compression ratios. It’s commonly used for applications that require low read latency.

To use WAL compression in PostgreSQL 15, you can set the wal_compression parameter. The supported methods are pglz, lz4 (if PostgreSQL was compiled with --with-lz4), and zstd (if PostgreSQL was compiled with --with-zstd). The default value for PostgreSQL 15 is off. See the following code:

postgres=> show wal_compression ;
 wal_compression
-----------------
 off
(1 row)

postgres=> SET wal_compression TO on;
SET
postgres=> show wal_compression ;
 wal_compression
-----------------
 pglz
(1 row)

In Amazon RDS, it is enabled by default with the zstd option. You can see the performance difference between each available WAL compression method through benchmarking.

The following graph provides a representation of transactions per second (TPS) as they correspond to varying numbers of connections, each associated with distinct WAL compression methods.

The following graph displays the size of WAL files relative to the number of connections for each respective WAL compression method.

The result shows a significant performance improvement in terms of transactions per second (up to 41% improvement) and reduction in size of WAL generated (up to 79% improvement).

Enhancements in logical replication

Logical replication is a technique of replicating data objects and related changes or transactions at a logical level (based on replication identity). Unlike physical replication, which replicates the entire database cluster, logical replication allows for selective replication of data.

In logical replication, changes made to a source table are captured as a stream of logical changes and then replayed on a target table. This allows for real-time replication of changes, while also allowing for more fine-grained control over what data is replicated.

The following are some use cases for logical replication:

  • Real-time replication of data between geographically dispersed databases
  • Replication of data between different PostgreSQL versions
  • Replication of specific columns or rows within a table
  • Synchronization of multiple databases to maintain data consistency

In previous versions of PostgreSQL, logical replication was limited to replicating entire tables. However, in PostgreSQL 15, logical replication has been enhanced to allow for more granular replication, including the ability to replicate specific columns within a table.

The following are the steps to look at column-level replication:

  1. Set up logical replication in your Aurora PostgreSQL or RDS for PostgreSQL databases. For instructions, refer to Setting up logical replication for your Aurora PostgreSQL DB cluster or Performing logical replication for Amazon RDS for PostgreSQL, respectively.

    For the examples in this post, we use Aurora PostgreSQL databases as source and target.

  2. Connect to the source Aurora PostgreSQL instance and create a source database, table, and publication as follows:
    -- Create the source database and table
    CREATE DATABASE source_db;
    \c source_db;
    CREATE TABLE my_table (
        id SERIAL PRIMARY KEY,
        col1 TEXT NOT NULL,
        col2 TEXT NOT NULL,
        col3 TEXT NOT NULL);
    -- Create the publication
    CREATE PUBLICATION test_pub FOR TABLE my_table (col1, col2);
    

    In this example, we create a publication for the my_table table but only replicate the col1 and col2 columns. We then create a subscription to the publication on a target database.

  3. Connect to the target Aurora PostgreSQL instance and create a target database, table with the selected columns, and subscription as follows:
    -- Create the target database and table
    CREATE DATABASE target_db;
    \c target_db;
    CREATE TABLE my_table (
        id SERIAL PRIMARY KEY,
        col1 TEXT NOT NULL,
        col2 TEXT NOT NULL
    );
    -- Create the subscription
    CREATE SUBSCRIPTION test_sub
        CONNECTION 'host=<source_host> port=5432 dbname=source_db user=postgres password=<password>’ 
    PUBLICATION test_pub;
    
  4. Insert some data into the source table and verify the target database:
    -- Check on the source
    source_db=> INSERT INTO my_table (col1, col2, col3) VALUES ('foo', 'bar', 'baz');
    INSERT 0 1
    source_db=> SELECT * FROM my_table;
     id | col1 | col2 | col3
    ----+------+------+------
    1 | foo  | bar  | baz
    (1 row)
    
    -- Check on the target
    target_db=> SELECT * FROM my_table;
     id | col1 | col2
    ----+------+------
      1 | foo  | bar
    (1 row)
    

ALTER SUBSCRPTION … SKIP

If incoming data violates any constraints, logical replication will stop until it is resolved. By using the ALTER SUBSCRIPTION … SKIP command, the logical replication worker skips all data modification changes within the transaction. It can be done by specifying the log sequence number (LSN). Note that it requires superuser privileges.

Consider a duplicate transaction on the subscription. The logs show the following entries in the target database logs:

2023-07-09 20:37:35.103 CDT [26803] ERROR:  duplicate key value violates unique constraint "uniq_my_table"
2023-07-09 20:37:35.103 CDT [26803] DETAIL:  Key (col1, col2)=(foo, bar) already exists.
2023-07-09 20:37:35.103 CDT [26803] CONTEXT:  processing remote data for replication origin "pg_16400" during message type "INSERT" for replication target relation "public.my_table" in transaction 19389, finished at 7/185A66A8
2023-07-09 20:37:35.104 CDT [25874] LOG:  background worker "logical replication worker" (PID 26803) exited with exit code 1

In PostgreSQL 15, with the ALTER SUBSCRPTION … SKIP feature, you can skip particular transactions using the LSN number that’s in the logs:

target_db=# ALTER SUBSCRIPTION my_sub SKIP (lsn = '7/185A66A8');
ALTER SUBSCRIPTION

Now, the logs show that this transaction has been skipped:

2023-07-09 20:39:50.604 CDT [25874] LOG:  background worker "logical replication worker" (PID 27371) exited with exit code 1
2023-07-09 20:39:55.610 CDT [27376] LOG:  logical replication apply worker for subscription "my_sub" has started
2023-07-09 20:39:55.619 CDT [27376] LOG:  logical replication starts skipping transaction at LSN 7/185A66A8
2023-07-09 20:39:55.619 CDT [27376] CONTEXT:  processing remote data for replication origin "pg_16400" during message type "BEGIN" in transaction 19389, finished at 7/185A66A8
2023-07-09 20:39:55.619 CDT [27376] LOG:  logical replication completed skipping transaction at LSN 7/185A66A8
2023-07-09 20:39:55.619 CDT [27376] CONTEXT:  processing remote data for replication origin "pg_16400" during message type "COMMIT" in transaction 19389, finished at 7/185A66A8

Automatically disable logical replication on error

By default, when a conflict arises during logical replication, the worker enters an error loop, repeatedly attempting to apply the same change in the background after restarting. However, a new option (disable_on_error) allows the subscription worker to automatically disable the subscription upon encountering an error, breaking the loop. This empowers users to make informed decisions on how to proceed. Furthermore, during the initial table synchronization, any failures will also disable the subscription. The default value for this option is false—the same error would be recurring in case of a conflict. The following command sets the option:

ALTER SUBSCRIPTION my_sub SET disable_on_error = true;

You can enable the replication when the publisher is ready:

ALTER SUBSCRIPTION my_sub ENABLE;

Publish all tables in a schema

Previously, you could only publish all tables in a database. However, PostgreSQL 15 allows you to publish all the tables in a specific schema:

CREATE PUBLICATION my_pub FOR ALL TABLES IN SCHEMA <schema_name>;

SELECT DISTINCT

In PostgreSQL 15, the SELECT DISTINCT query can be parallelized using a two-phase distinct process. In phase 1, parallel workers make rows distinct by either hashing or sorting them. The results from the parallel workers are then combined, and the final distinct phase is performed serially to remove any duplicate rows that appear due to combining rows for each of the parallel workers. This allows PostgreSQL to improve the performance of SELECT DISTINCT queries on large datasets by splitting the work across multiple CPU cores.

Let’s look at the following example in PostgreSQL 14 and PostgreSQL 15.

In PostgreSQL 14, we use the following code:

CREATE TABLE select_distinct (id INT);
insert into select_distinct values (generate_series(1,3000000));

postgres=> explain analyze select distinct id from select_distinct;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=204525.00..257962.50 rows=3000000 width=4) (actual time=1301.435..2392.225 rows=3000000 loops=1)
   Group Key: id
   Planned Partitions: 128  Batches: 129  Memory Usage: 4113kB  Disk Usage: 122464kB
   ->  Seq Scan on select_distinct  (cost=0.00..43275.00 rows=3000000 width=4) (actual time=0.015..349.542 rows=3000000 loops=1)
 Planning Time: 0.153 ms
 Execution Time: 2645.889 ms
(6 rows)

You can see the normalSeq Scanon the table for the DISTINCT query.

In PostgreSQL 15, we use the following code:

CREATE TABLE select_distinct (id INT);
INSERT INTO select_distinct VALUES (generate_series(1,3000000));
postgres=> EXPLAIN ANALYZE SELECT DISTINCT id FROM select_distinct;
                                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=169449.88..211349.32 rows=3000000 width=4) (actual time=222.888..1151.628 rows=3000000 loops=1)
   ->  Gather Merge  (cost=169449.88..203849.32 rows=3000000 width=4) (actual time=222.887..730.521 rows=3000000 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=169449.85..172574.85 rows=1250000 width=4) (actual time=217.158..302.022 rows=1000000 loops=3)
               Sort Key: id
               Sort Method: external merge  Disk: 12328kB
               Worker 0:  Sort Method: external merge  Disk: 11480kB
               Worker 1:  Sort Method: external merge  Disk: 11488kB
               ->  Parallel Seq Scan on select_distinct  (cost=0.00..25775.00 rows=1250000 width=4) (actual time=0.012..79.246 rows=1000000 loops=3)
 Planning Time: 0.443 ms
 Execution Time: 1299.492 ms
(12 rows)

You can see theParallel Seq Scanon the table for the DISTINCT query.

Improved performance for sorts that exceed work_mem

In version 14, PostgreSQL uses a polyphase merge algorithm for sorting. However, in PostgreSQL 15, this algorithm is replaced with a simple balanced k-way merge algorithm. Due to this, sorting on a single column is improved in PostgreSQL 15.

Let’s look at the following example in PostgreSQL 14 and PostgreSQL 15.

In PostgreSQL 14, use the following code:

CREATE TABLE test_sort (id INT, name VARCHAR);
INSERT INTO test_sort VALUES (generate_series(1,3000000),'test'||generate_series(1,3000000));

postgres=> EXPLAIN ANALYZE SELECT id FROM test_sort order by id;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=450997.97..458497.97 rows=3000000 width=4) (actual time=1288.173..1678.183 rows=3000000 loops=1)
   Sort Key: id
   Sort Method: external merge Disk: 41176kB
   ->  Seq Scan on test_sort  (cost=0.00..46217.00 rows=3000000 width=4) (actual time=0.008..413.279 rows=3000000 loops=1)
 Planning Time: 0.164 ms
 Execution Time: 1912.809 ms
(6 rows)

The process took 1.9 seconds, and sorting went to disk with 41176 KB in size.

In PostgreSQL 15, use the following code:

CREATE TABLE test_sort (id INT, name VARCHAR);
INSERT INTO test_sort VALUES (generate_series(1,3000000),'test'||generate_series(1,3000000));

postgres=> EXPLAIN ANALYZE SELECT id FROM test_sort ORDER BY id;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=450997.97..458497.97 rows=3000000 width=4) (actual time=1098.836..1447.813 rows=3000000 loops=1)
   Sort Key: id
   Sort Method: external merge  Disk: 35280kB
   ->  Seq Scan on test_sort  (cost=0.00..46217.00 rows=3000000 width=4) (actual time=0.015..429.089 rows=3000000 loops=1)
 Planning Time: 1.926 ms
 Execution Time: 1679.625 ms
(6 rows)

The process took 1.6 seconds, and sorting went to disk with 35280 KB in size, an improvement of approximately 12.18% in query runtime.

Let’s consider Amazon RDS Optimized Reads (using a db.r6gd.large instance) for the same example:

CREATE TABLE test_sort (id INT, name VARCHAR);
INSERT INTO test_sort VALUES (generate_series(1,3000000),'test'||generate_series(1,3000000));

postgres=> EXPLAIN ANALYZE SELECT id FROM test_sort ORDER BY id;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=450997.97..458497.97 rows=3000000 width=4) (actual time=770.792..1038.945 rows=3000000 loops=1)
   Sort Key: id
   Sort Method: external merge  Disk: 35280kB
   ->  Seq Scan on test_sort  (cost=0.00..46217.00 rows=3000000 width=4) (actual time=0.007..309.724 rows=3000000 loops=1)
 Planning Time: 0.157 ms
 Execution Time: 1178.586 ms
(6 rows)

With the same disk sort (35280 KB), the runtime is 1.1 seconds, which is another 30% improvement.

PUBLIC schema permissions and ownership

PostgreSQL 15 has introduced several new security enhancements, one of which is the removal of the PUBLIC creation permission on the public schema. The public schema is the default schema that is created when a new database is created. This schema is owned by the default user, usually the installation user, and is used to store all objects that are created without specifying a schema. This change is designed to improve the security of PostgreSQL databases by restricting access to the public schema.

In previous versions of PostgreSQL (14 and earlier), the public schema was created automatically and was accessible to all users by default. This meant that by default, users with access to this database could create objects in the public schema, which under certain conditions could allow users to create attacks, as specified in CVE-2018-1058.

To mitigate this issue, PostgreSQL 15 removes the ability for users to create objects in the public schema by default. Instead, users must explicitly grant the CREATE permission on the public schema to other users or roles if they wish to allow them to create objects in the public schema.

Note that this change could impact people upgrading, particularly those who have existing database bootstrapping scripts that rely on the previous behavior.

In PostgreSQL 14, use the following code:

postgres=> CREATE USER test_public WITH PASSWORD 'test';
CREATE ROLE
Time: 11.779 ms
postgres=> \c - test_public
psql (15.2, server 14.5)
You are now connected to database "postgres" as user "test_public".
postgres=> CREATE TABLE test_public(id INT);
CREATE TABLE
Time: 26.138 ms
postgres=>

As you can see, the new user has permissions on the public schema by default.

In PostgreSQL 15, use the following code:

postgres=> CREATE USER test_public WITH PASSWORD 'test';
CREATE ROLE
Time: 9.804 ms
postgres=> \c - test_public
You are now connected to database "postgres" as user "test_public".
postgres=> CREATE TABLE test_public(id INT);
ERROR:  permission denied for schema public
LINE 1: create table test_public(id int);
                     ^
Time: 1.792 ms
postgres=>
postgres=> \c - postgres
You are now connected to database "postgres" as user "postgres".
postgres=>
postgres=> GRANT CREATE ON SCHEMA public TO test_public;
GRANT
Time: 2.729 ms
postgres=> \c - test_public
You are now connected to database "postgres" as user "test_public".
postgres=> CREATE TABLE test_public(id INT);
CREATE TABLE
Time: 14.758 ms
postgres=>

PostgreSQL 15 doesn’t allow new users to create objects in the public schema by default. You would need to explicitly grant CREATE privileges to specific users on the public schema.

The following code is an example of how to remove PUBLIC creation permission on the public schema prior to PostgreSQL 15:

-- Revoke creation permission on the public schema from PUBLIC
REVOKE CREATE ON SCHEMA public FROM PUBLIC;

Prior to PostgreSQL 15, the public schema was owned by the database superuser postgres (or the bootstrap user) role by default in any database. You typically provide ownership to a user because it’s a trusted user. However, the risk comes from being able to perform trojan-style attacks that can subvert the database owner or superuser (CVE-2018-1058). This is a risk mitigation feature to help remove that attack vector.

To address this issue, the public schema is owned by the new pg_database_owner role in PostgreSQL 15 by default. The role is automatically granted to the database owner when the new database is created.

postgres=> \dn public
  List of schemas
  Name  |  Owner
--------+----------
 public | postgres
(1 row)

In PostgreSQL 15, use the following code:

postgres=> \dn public
      List of schemas
  Name  |       Owner
--------+-------------------
 public | pg_database_owner
(1 row)

To change the owner of the public schema to the new pg_database_owner role prior to PostgreSQL 15, run the following SQL command:

ALTER SCHEMA public OWNER TO pg_database_owner;

Security invoker views

PostgreSQL 15 includes a new feature that allows you to create views with the security invoker option. Security invoker views are run with the permissions of the user accessing the view, not the creator. This provides an extra layer of security and control over the data that is returned by the view.

To create a view with the security invoker option, add the SECURITY INVOKER clause to the CREATE VIEW statement, as shown in the following example:

CREATE VIEW test_view AS
SELECT * FROM test_table
SECURITY INVOKER;

UNIQUE NULLS NOT DISTINCT

PostgreSQL 14 and older versions are treating NULL values as distinct or not equal to other NULL values. This is in line with the SQL standard because NULL is an unknown value. Therefore, it’s comparing one unknown to another unknown. In earlier PostgreSQL versions, a UNIQUE constraint could still accept multiple rows with NULL values.

In PostgreSQL 14, use the following code:

postgres=# 
CREATE TABLE earlier_version (
id integer primary key, 
name varchar NOT NULL, 
dept varchar default NULL, 
CONSTRAINT uniq_name_dept UNIQUE (name,dept));
CREATE TABLE


postgres=> INSERT INTO earlier_version (id,name,dept) SELECT generate_series(1,6),'abc',NULL;
INSERT 0 6


postgres=> SELECT * FROM earlier_version ;
 id | name | dept
----+------+------
  1 | abc  | NULL
  2 | abc  | NULL
  3 | abc  | NULL
  4 | abc  | NULL
  5 | abc  | NULL
  6 | abc  | NULL
(6 rows)

As you can see in the preceding example, although there is a unique constraint created on the name and dept columns, we can still add repeated NULL values in the dept column.

With the new option, the unique constraint defined on the latest_version table has become more restraining.

In PostgreSQL 15, use the following code:

postgres=> CREATE TABLE latest_version (
id integer primary key, name varchar NOT NULL, 
dept varchar default NULL, 
CONSTRAINT new_uniq_name_dept UNIQUE NULLS NOT DISTINCT (name,dept));
CREATE TABLE

postgres=> INSERT INTO latest_version (id,name,dept) SELECT 1,'abc',NULL;
INSERT 0 1
postgres=> SELECT * FROM latest_version ;
 id | name | dept
----+------+------
  1 | abc  | NULL
(1 row)

The row is added successfully. Now, if we try to add another row in this table with NULL in the dept column, it throws a unique constraint violation error:

postgres=> INSERT INTO latest_version (id,name,dept) SELECT 2,'abc',NULL;
ERROR:  duplicate key value violates unique constraint "new_uniq_name_dept"
DETAIL:  Key (name, dept)=(abc, null) already exists.

The UNIQUE NULLS NOT DISTINCT option adds a more granular level of control in handling NULLs in PostgreSQL. This is an additional option available starting from PostgreSQL 15 and the default behavior will be still the same as that of PostgreSQL versions 14 and earlier.

Conclusion

PostgreSQL 15 introduces impactful features and enhancements that make it more straightforward to manage complex data workloads while improving security and performance. You can get started using PostgreSQL 15 on Aurora and Amazon RDS, and use features such as Amazon RDS Optimized Reads to further accelerate performance with PostgreSQL 15 features.

Take advantage of the great new features in PostgreSQL version 15 and experience the potential for even more innovation in future releases. Join the community mailing lists and stay up to date with the latest developments in one of the most powerful and reliable database management systems available.


About the Authors

Baji Shaik is a Sr. Lead Consultant with AWS Professional Services, Global Competency Centre. His background spans a wide depth and breadth of expertise and experience in SQL and NoSQL database technologies. He is a Database Migration Expert and has developed many successful database solutions addressing challenging business requirements for moving databases from on premises to Amazon RDS and Amazon Aurora PostgreSQL/MySQL. He is an eminent author, having written several books on PostgreSQL. A few of his recent works include “PostgreSQL Configuration,” “Beginning PostgreSQL on the Cloud,” and “PostgreSQL Development Essentials.” Furthermore, he has delivered several conference and workshop sessions.

Jonathan Katz is a Principal Product Manager – Technical on the Amazon RDS team and is based in New York. He is a Core Team member of the open-source PostgreSQL project and an active open-source contributor.

Swanand Kshirsagar is a Lead Consultant within the Professional Services division at Amazon Web Services. He specializes in collaborating with clients to architect and implement scalable, robust, and security-compliant solutions within the AWS Cloud environment. His primary expertise lies in orchestrating seamless migrations, encompassing both homogenous and heterogeneous transitions, facilitating the relocation of on-premises databases to Amazon RDS and Amazon Aurora PostgreSQL with efficiency.