AWS Database Blog
Synopsis of several compelling features in PostgreSQL 16
PostgreSQL 16, released in September, 2023, brings performance improvements and many new features to the popular open source relational database. These features help in all aspects of how you manage your data—speeding up ingestion, providing more options for distributing your data, and offering more insight into how your queries are performing. You can try out PostgreSQL 16 on both Amazon Aurora PostgreSQL-Compatible Edition and Amazon Relational Database Service (Amazon RDS) for PostgreSQL.
In this post, we explore the new features in PostgreSQL 16 and discuss how they improve performance and query speed. This includes new replication features, including logical decoding on standbys and parallel application of logical replication, SQL/JSON functionality, new monitoring tools, such as the pg_stat_io
system view, and security features.
Prerequisites
To follow along with this post, complete the following prerequisites:
- Create an Amazon Aurora PostgreSQL-compatible cluster or RDS for PostgreSQL (v16) instance if you don’t already have one. For instructions, refer to Create an Amazon Aurora PostgreSQL-Compatible DB cluster or Create a PostgreSQL DB instance, respectively.
- Create an Amazon Elastic Compute Cloud (Amazon EC2) instance to install the PostgreSQL client to access the Aurora PostgreSQL or RDS for PostgreSQL instance. For instructions, refer to Create your EC2 resources and launch your EC2 instance. Or you can set up connectivity between your RDS database and EC2 compute instance in one click.
- Install the PostgreSQL client. On Amazon Linux 2023, you can use the following commands to download the psql command line tool:
New features and enhancements to logical replication
Logical replication is a mechanism for replicating data objects, modifications, or transactions at the logical level (based on replication identity). Unlike physical replication, which copies the whole database cluster, logical replication enables data replication to be selectively replicated. Changes made to a source table are recorded as a stream of logical changes and then replayed on a destination table in logical replication. This enables real-time replication of changes while also providing finer-grained control over what data is replicated.
Parallel apply of large transactions
In previous versions, for large transactions, the publisher sends the data in multiple streams and then, on the subscriber, an apply worker process collects data to a temporary file. After receiving a commit, the worker reads from the temporary file and applies the entire transaction to the target table, this leads to a replication delay.
In PostgreSQL 16, a new parallel apply option is introduced which writes data directly to the target table using multiple background worker processes. This helps reduce latency. You can enable this option when creating the subscription. Along with this, the substream
column of catalog table pg_subscription
is changed to ‘char
’ from type Boolean, to show if a subscription type is parallel or not. that is If substream column’s value is P
, then the parallel option is enabled; if it is f
, then it disallows streaming of in-progress transactions.
The parameter max_parallel_apply_workers_per_subscription helps control the number of worker processes for the created subscription; the default value is 2. For this post, we performed a small test case with and without the parallel option to load the data.
Example : Replicating a transaction with 10 million rows
For this example, we setup the RDS databases and enable logical replication parameters (set rds.logical to 1
in the parameter group) and set up replication through the pub/sub method.
We use the following code for the publication server:
And the following code for the subscription server:
We use the following code to insert data into the source table, which took about 1 minute to complete:
As soon as data is inserted into the primary table, we used the watch command to continuously monitor the time and row count of the target table. The following output shows how long it took to load the data without the parallel option:
Without the parallel option, it took around 2.5 minutes to replicate the data. Also, we captured the logical replication processes from the enhanced monitoring OS list.
The following is the output of pg_subscription
, where the substream
value f
indicates that the parallel option was not enabled:
Let’s run the same test with parallel apply enabled
We use the following code to set up the publication server:
We use the following code to set up the subscription server:
We inserted the data into the source table, which took approximately 1 minute to complete:
As soon as data is inserted into the primary table, we started monitoring the time and row count of the target table. The following output shows how much time it took to load the data with the parallel option:
With the parallel option, it took around 1 minute to replicate the data. Also, we captured logical replication processes from the enhanced monitoring OS list, and two workers running.
Depending up on available resources on the instance, we can further accelerate applying the changes to target table with additional parallel workers.
The following is the output of pg_subscription
, where substream
value p
indicates that the parallel option is enabled:
The parallel option help you apply the data in less time to the target table, but a greater number of workers may impact database performance.
Logical replication from a standby instance
Before PostgreSQL 16, logical replication was limited to replicating data only from the primary instance. However, in PostgreSQL 16, you can now replicate data from a standby instance, which can help reduce the worker overhead from the primary instance.
The following figure illustrates three databases running on different ports. Database A is the primary read/write database, which is running on port 5432. The hot standby database B is running on port 5433, and database C is running on port 5434. Connectivity between B to C is set up using logical replication.
Create RDS PostgreSQL v16 cluster with a primary and a standby instance. Create another RDS PostgreSQL v16 instance in the same VPC as the above logical replication setup. Then,
Use the following code to create two tables in the primary database (A):
Create a user for replication:
Create the replica and then create a publication in database A:
Take the database schema dump using pg_dump
and then restore it in database C using the psql -f
command. After the schema is restored, create a subscription in database C using the following command and provide the port of database B (the hot standby):
Use the following code to check the replication status in database B:
Load data to the r1
table in database A and check the data in database C:
We can see the data is replicating from the hot standby. This feature can help you avoid the load on the primary database.
New SQL/JSON functionality
PostgreSQL started supporting JSON functionality since its 9.2 version, which was released in 2012. This support includes storing JSON data and enabling users to build complex queries on JSON data using SQL. Over the years, PostgreSQL has enhanced its JSON capabilities, introducing JSONB in version 9.4, which stores JSON data in a binary format for faster searches and efficiency.
PostgreSQL 16 introduces a set of features related to JSON
that aim to make the engine more compliant with the SQL standard. One of these features is support for the IS JSON
predicate.
The IS JSON
predicate allows you to check the validity of a JSON
object based on its structure. It includes several tests, such as IS JSON VALUE, IS JSON ARRAY, IS JSON OBJECT
, and IS JSON SCALAR
. These tests operate on text and bytea
values representing JSON
, as well as on the json
and jsonb
types.
The IS JSON ARRAY
test validates all valid JSON
arrays, including arrays with various types of elements. The IS JSON OBJECT
test verifies JSON values parsed with curly brackets, whereas the IS JSON SCALAR
test applies to single JSON
values such as null, integers, strings, and booleans. The IS JSON VALUE
test, as an optional keyword, checks if a value can be parsed as JSON
, including single values, arrays, JSON
objects, and empty objects.
Let’s look at an example of performing IS JSON
tests.
Create a test table and insert some data:
The following query demonstrates the IS JSON tests on various JSON values. It provides insights into whether each value is considered valid as JSON, a scalar, an object, or an array.
It returns the following output:
Another important aspect is the WITH UNIQUE KEYS and WITHOUT UNIQUE KEYS clauses, which validate an object by checking for unique keys. These flags can be used with the IS JSON OBJECT and IS JSON ARRAY tests. If an object contains duplicate keys, it is considered invalid.
The following is a sample query to show the usage of the WITH UNIQUE KEYS and WITHOUT UNIQUE KEYS flags:
It returns the following output:
Another feature that is related to JSON is the support of SQL/JSON standard-conforming constructors for JSON
types like JSON_ARRAY(), JSON_ARRAYAGG(), JSON_OBJECT(),
and JSON_OBJECTAGG().
These constructors allow for the skipping or inclusion of NULL
values, and provide options to allow or disallow duplicate keys. Furthermore, the constructors follow the standard-specified syntax for specifying output type and format. With these enhancements, PostgreSQL 16 provides a more comprehensive and standardized approach to working with JSON
data.
The JSON_ARRAY()
constructor allows for the creation of JSON
arrays, providing a way to organize and structure multiple values within a single JSON
object. The JSON_ARRAYAGG()
function aggregates multiple input values into a single JSON
array, making it straightforward to combine and process data. The JSON_OBJECT()
constructor enables the creation of JSON
objects, which are key-value pairs that can represent structured data. The JSON_OBJECTAGG()
function aggregates multiple key-value pairs into a single JSON
object, simplifying the process of constructing complex JSON
structures.
The following are some examples to show how to use each of these constructors:
These new JSON features in PostgreSQL 16 enhance the engine’s compliance with the SQL standard and provide more robust validation capabilities for JSON data. They allow for more precise and efficient handling of JSON objects, arrays, and scalars, enabling developers to work with JSON data more effectively.
Speed up concurrent data ingestion with COPY
PostgreSQL 16 improves efficiency with a significant upgrade to its bulk extension feature for relations. The new method allows a relation to be extended by multiple blocks either when the caller needs multiple pages or when there’s contention on the extension lock. This reduces overhead and avoids I/O operations while a content lock is held, enhancing overall performance.
This improvement provides substantial benefits, with tests showing a significant increase in speed for concurrent COPY into a single relation. Even single-threaded COPY and single-row INSERT operations show measurable improvement due to the efficient use of the extension lock. This change marks a significant improvement in PostgreSQL’s handling of relation extensions, enhancing both performance and efficiency.
Let’s look at the sample benchmarking to check the performance benefits. This test is performed on PostgreSQL 15 and PostgreSQL 16 installed on a t2.large EC2 instance, and pgbench is used to run the tests.
Benchmark with pgbench
The performance test was conducted using pgbench, which is a benchmarking tool for PostgreSQL. The test was run with 16 concurrent users, for both single-threaded and multi-threaded configurations. The total number of transactions performed for each benchmark was set to 512, which is proportional to the number of users, so the same volume of data was processed in each run. The data used for this test was contained in small files approximately 9 MB in size, consisting of 100,000 rows of data for a single-column table.
Use the following command to generate the small file:
Use the following pgbench
command for a single-thread configuration:
Use the following pgbench
command for a multi-thread configuration:
The command includes the following parameters:
- -c – Number of clients
- -j – Number of threads
- -t – Number of transactions
- -f – Refers to the
c_s.sql
file with the COPY command to load thecopytest_0
table:COPY copytest_0 FROM '/location/to/copytest_data_small.copy';
The following table summarizes the results of the tests with PostgreSQL 15 and PostgreSQL 16.
Version | Number of Clients | Number of Threads | Number of Transactions | Completion Time (seconds) | Table Size (after COPY) | Throughput (size/time) |
PostgreSQL 15 | 16 | 1 | 512 | 180 | 6095MB | 33Mbps |
16 | 512 | 155 | 6095MB | 39Mbps | ||
PostgreSQL 16 | 16 | 1 | 512 | 101 | 6095MB | 60Mbps |
16 | 512 | 96 | 6095MB | 63Mbps |
From these results, with 16 clients in single thread, we can see a 1.8 times faster performance improvement in PostgreSQL 16 (101 seconds, 60 Mbps throughput) when compared to PostgreSQL 15 (180 seconds, 33 Mbps). Also, with 16 clients in multi-thread (16 threads), we can see a 1.6 times faster performance improvement in PostgreSQL 16 (96 seconds, 63 Mbps throughput) when compared to PostgreSQL 15 (155 seconds, 39 Mbps throughput).
Subtransaction performance improvements
PostgreSQL uses snapshots to maintain database ACID properties, providing a consistent view of data at a specific point. Each snapshot keeps track of active transaction identifiers and subtransaction identifiers. In PostgreSQL 16, the mechanism for looking up these identifiers in snapshots has been significantly enhanced. The system now uses SIMD (Single Instruction, Multiple Data) instructions, allowing multiple data points to be processed simultaneously. This greatly improves efficiency, particularly for operations on large datasets.
This enhancement has notable performance implications. When dealing with large arrays of transaction identifiers, the SIMD-accelerated lookups can significantly speed up operations, such as checking row visibility or determining transaction dependencies. This feature is especially beneficial in environments with a high number of concurrent writers, helping PostgreSQL maintain performance and responsiveness under heavy workloads.
Benchmark with heavy concurrent writes
Let’s look at a comprehensive benchmarking exercise focused on heavy concurrent writes to examine the ensuing results on PostgreSQL 15 and PostgreSQL 16 installed on a t2.large EC2 instance. The benchmark uses pgbench, incorporating the following components:
- Single read client – A solitary client runs the following SELECT query, saving the command to an
xid-select.sql
file. The target table is truncated to 10,000 rows.
- Variable number of write clients – A variable number of clients (ranging from 16–1000) engages in concurrent UPDATE operations. The number of clients varies in the range of 16, 32, 64, 128, 256, 512, 768, and 1,000. The SQL commands are as follows, with each update operation followed by a brief idle period:
- XIDs – XIDs are consumed using the following command, with the SELECT command saved to an
xid-commit.sql
file:
The main goal is to measure the transactions per second (TPS) of the read-only clients while multiple write clients are operating under controlled conditions. These write clients pause for a brief period (a few milliseconds) after each update, resulting in snapshots with a significant number of XIDs.
To run the benchmark, refer to the following script. This benchmark is designed to offer valuable performance metrics in situations where both write and read operations occur simultaneously, enabling a thorough assessment of system capabilities.
The following table summarizes the results (number of transactions per second) from running the script against PostgreSQL 15 and PostgreSQL 16.
Write Clients | PG 15 | PG 16 |
16 | 939.0 | 959.6 |
32 | 860.8 | 870.8 |
64 | 833.4 | 800.8 |
128 | 693.2 | 724.2 |
256 | 665.9 | 650.6 |
512 | 468.4 | 473.6 |
768 | 334.1 | 396.0 |
1000 | 234.1 | 332.2 |
The following figure visualizes these results.
As evident from the results, a notable enhancement in TPS is observed, particularly when the number of clients exceeds 768. Notably, with 1,000 write clients, we observe a substantial increase in TPS from 234 to 332, signifying an almost 41% improvement in performance.
Track when a table or index was last used
PostgreSQL 16 introduces a new feature to track the timestamp of the last scan of a relation, which is useful for index evaluation and monitoring relation usage over time. This timestamp is updated only when the backend pending stats entry is flushed to shared stats, enhancing PostgreSQL’s monitoring capabilities.
Additionally, PostgreSQL 16 introduces a new counter, n_tup_newpage_upd
, which tracks row updates that result in a new heap page. This feature complements the existing n_tup_hot_upd
and n_tup_upd
columns, aiding in precise database performance tuning and identifying tables that may benefit from adjustments to their heap fill factors.
The following table summarizes the modifications to pg_stat_*_tables
.
Column Name | Data Type | Description |
---|---|---|
last_seq_scan |
timestamp with time zone | Records the most recent access to the relevant relation by SeqScan |
last_idx_scan |
timestamp with time zone | Documents the most recent access to the relevant relation by IndexScan |
The following table summarizes the modifications to pg_stat_*_indexes
.
Column Name | Data Type | Description |
---|---|---|
last_idx_scan |
timestamp with time zone | Captures the most recent access to the index file |
These changes are particularly beneficial for monitoring the efficacy of created indexes. Let’s imagine a scenario where you’ve created an index on a table, but the index is never actually utilized. This results in not only wasted index space but also unnecessary costs updating the index each time the table is updated.
By monitoring the last_seq_scan
and last_index_scan
timestamps in pg_stat_user_tables
, you can track when SeqScan
was recently performed on a table or when the table was accessed using an index.
Let’s walk through an example to see these modifications in action. We start by creating a test_table
table, add some data, create an index, and reset the activity statistics using pg_stat_reset()
. Then we can monitor the updates in pg_stat_user_tables
.
Run the following query and recheck pg_stat_user_tables
:
We notice an increase in the idx_scan
count and an updated last_idx_scan
timestamp.
Finally, run a query to calculate the average of the balance column without a WHERE clause condition. This query invokes a SeqScan
, as shown by an increase in the SeqScan
count and an update to the last_idx_scan
timestamp.
Monitor I/O with pg_stat_io
PostgreSQL 16 introduces the pg_stat_io view for more comprehensive I/O statistics. Previous I/O statistics, like pg_stat_bgwriter
and pg_stat_database
, were limited and didn’t provide detailed information to understand the primary sources of I/O or evaluate configuration changes. The pg_stat_io
view addresses these limitations by counting and aggregating I/O operations.
The view displays I/O operations for specific combinations of backend type, which refers to the type of backend handling the connections or server process (such as autovacuum worker or bgwriter), target object of the I/O (relations, temp relations) and context of the I/O (normal, vacuum, bulkread, bulkwrite). Each column represents the total number of a specific I/O operation. Certain combinations that never occur are omitted, and if an I/O operation never occurs for a given combination, the cell will be null. This new feature provides a more detailed, accurate, and customizable way to monitor I/O operations in PostgreSQL.
It’s important to note that some of the cells in the view are redundant with fields in pg_stat_bgwriter
(for example, buffers_backend
). For now, these redundancies have been retained for backward-compatibility.
After you initialize and start your database cluster, run the following query to create a view:
It returns the following output:
Focusing on the reads and writes statistics for the ‘client backend’ in PostgreSQL is important for performance monitoring and optimization. The backend processes are responsible for handling client requests, which include running SQL queries and returning results. By closely monitoring reads (data retrieval) and writes (data modification), you can gain valuable insights into how these operations impact the overall performance and efficiency of the database. This can help identify any bottlenecks or areas for improvement, enabling more efficient resource allocation and better overall database performance.
The backend_type
shows normal and reads as not zero because it’s likely reflecting the number of I/O operations for system catalogs not stored in the cache.
For this experiment, we use pgbench to create I/O. Begin with initialization at a scale factor of 10:
Check the I/O stats for the same backend_type
:
For this experiment, we can observe the following details:
- During the initialization of pgbench, data loading is performed using the
COPY
command. This operation is counted as writes in the bulkwrite line. - After data loading, a VACUUM operation is carried out. At this point, there are no dead tuples, so writes are not counted. However, reads during the VACUUM operation are logged and counted in the vacuum line.
- At the end of the initialization process, a primary key is set, which triggers implicit index generation. This process reads data directly, bypassing the shared buffer. As a result, these reads are counted under the bulkread line.
Let’s configure pgbench to run with two connections, each running 500 transactions. This modification should introduce changes as the vacuum pgbench command runs.
Use the pg_stat_io
query as before:
The reason writes isn’t increasing likely stems from updates being exclusively conducted on the shared buffer.
By regularly gathering data from this view, you can achieve efficient monitoring of I/O operations across the database cluster. This feature provides real-time insight into any increase in I/O, which can be a critical factor in database performance.
New security features
This PostgreSQL 16 introduces the implementation of the `SYSTEM_USER
` SQL keyword in PostgreSQL 16. The `SYSTEM_USER
` keyword returns the authenticated identity (authn_id
) associated with the current database session, addressing situations where the authenticated identity differs from the `SESSION_USER
`.
The `SYSTEM_USER
` keyword is valuable in scenarios where the authenticated identity is not the same as the `SESSION_USER
`, preventing the loss of information from the connection’s perspective. This is particularly relevant in authentication methods like GSS (Generic Security Service) and peer authentication.
Let’s look at a few examples.
GSS authentication
Use the following code for pg_hba.conf
:
Use the following code for pg_ident.conf
:
Use the following code for your connection:
Run the following query:
You get the following output:
Here, the session_user
is identified as xyzuser
. However, in order to get the relevant authentication information system_user
column’s value can be referred.
Peer authentication
Use the following code for pg_hba.conf
:
Use the following code for pg_ident.conf
:
Use the following code for the connection:
Run the following query:
The following code shows the output:
The patch addresses situations where information about the authenticated identity was lost from the connection’s perspective, especially when it differs from the `SESSION_USER
`. By introducing the `SYSTEM_USER
` SQL reserved word, the authenticated identity becomes visible and can be retrieved in scenarios like GSS and peer authentication. This enhances the completeness of information available in the current session or for the `SESSION_USER
`.
The function returns an authentication method and authentication ID as “auth_method:user_identifier
”, indicating what the identifier actually is and how it was verified. It would return NULL
in case of TRUST authentication.
Note that Amazon Aurora and Amazon RDS don’t let you make direct modifications to the pg_hba.conf file as this is fully managed for you.
Row-level security policy
Let’s look at another example to understand how the SYSTEM_USER
functionality can help you get the original user information even if a mapped user accesses a database.
Create an employees
table with name, department, and salary information, with the understanding that each user should only be able to view information about themselves in the table. (No user should be able to view details for any other user.)
Create the employees
table:
Insert some sample data:
Create a row-level security policy to restrict access to rows based on name:
Enable the policy on the table:
Grant SELECT privileges on the table to relevant users and roles:
Let’s see an example, where user peterk gets database access to another user’s rows (John Doe):
As we can see, the currently logged in database user (current_user
) is John Doe. However, the system belongs to a different user (system_user
is different). Therefore, with the current row-level security policy, which is enabled on the employees table, user peterk
would get access to the row that actually belongs to user John Doe.
With the new system_user
function, which retrieves system user details along with authentication method, we can create a new row-level security policy to restrict access to users based on system user details instead of just the database user.
Drop the existing policy:
Create a new row-level security policy to restrict access to rows based on system_user
:
Now, when user peterk
accesses the database using user credentials for John Doe, with the system_user
information in the policy, you can restrict the user from accessing unintended information:
Therefore, although the database user is John Doe, the employees table data is displayed for user peterk
based on the system_user
information from the security policy:
Conclusion
PostgreSQL 16 continues to improve on what’s possible with database management, introducing significant enhancements and features that boost performance and monitoring capabilities. The release brings forward vector operations for XID and sub-XID searches, a new system view for tracking I/O statistics, and significant improvements in logical replication and SQL/JSON constructors. These offerings make PostgreSQL 16 a compelling choice for managing complex data workloads while providing robust security.
You can use these advanced features with PostgreSQL 16 on Amazon Aurora and Amazon RDS, and experience the potential for even more innovation in future releases. Stay connected with the community mailing lists to keep up to date on the latest developments in one of the most powerful and reliable database management systems available.
About the authors
Jonathan Katz is a Principal Product Manager – Technical on the Amazon RDS team and is based in New York. He is a Core Team member of the open-source PostgreSQL project and an active open-source contributor.
Rajesh Madiwale is a Lead Consultant with Amazon Web Services. He has deep expertise on database development and administration on Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL, Amazon Redshift, MySQL, and Greenplum databases. He is an ardent member of the PostgreSQL community and has been working on PostgreSQL his entire tenure. He has also delivered several sessions at PostgreSQL conferences.
Baji Shaik is a Sr. Database Consultant with AWS Professional Services. He is a Database Migration Expert and has developed many successful database solutions addressing challenging business requirements for moving databases from on premises Oracle, SQL Server to Amazon RDS and Amazon Aurora PostgreSQL/MySQL. He is an eminent author, having written several books on PostgreSQL. A few of his recent works include “ Procedural Programming with PostgreSQL PL/pgSQL” “ PostgreSQL Configuration,” “ Beginning PostgreSQL on the Cloud,” and “ PostgreSQL Development Essentials.” Furthermore, he has delivered several conference and workshop sessions.
Swanand Kshirsagar is a Lead Consultant within the Professional Services division at Amazon Web Services. He specializes in collaborating with clients to architect and implement scalable, robust, and security-compliant solutions within the AWS Cloud environment. His primary expertise lies in orchestrating seamless migrations, encompassing both homogenous and heterogeneous transitions, facilitating the relocation of on-premises databases to Amazon RDS and Amazon Aurora PostgreSQL with efficiency.