AWS Database Blog

PostgreSQL 12 – A deep dive into some new functionality

The PostgreSQL community continues its consistent cadence of yearly major releases with PostgreSQL 12. PostgreSQL 12 introduces features opening up new development opportunities while simplifying the management of some already robust capabilities, such as new ways to query JSON data, enhancements to indexes, and better performance with partitioned tables.

In this post, I take a close look some of the exciting new features in PostgreSQL 12. I explore how they can be incorporated into existing development and operational practices. Some of these features are transparent and simply upgrading to PostgreSQL 12 can leverage them. Others require changes to existing applications or processes to use them. I cover the benefits of these new features and show examples of how to adapt these features to existing applications.

Indexing improvements

In PostgreSQL, the default index type is a B-tree. B-tree indexes are used to index most types of data from integers that are primary keys to strings that are email addresses. As tables grow, so do the corresponding indexes. When B-tree indexes grow, the data structures need to stay balanced so that when a particular leaf page becomes full, the page must be split. In most cases, PostgreSQL splits those pages in the middle so that each newly split page has an equal amount of data and free space. If the data being added to the table is somewhat random, splitting in the middle is ideal. However, if the data has many duplicate index entries, splitting in the middle leaves a large amount of free space potentially unused. PostgreSQL 12 changes the logic of splitting B-tree index pages to use the context of duplicate index entries and uses some compression techniques. Depending on the data in the indexes, these improvements can result in some B-tree indexes being as much as 40% smaller on PostgreSQL 12.

Databases upgraded from an older version of PostgreSQL still have the older B-tree format. To take advantage of the new B-tree format, indexes must be created on PostgreSQL 12. PostgreSQL has a REINDEX command that rebuilds all indexes on a given table, but the REINDEX command takes a prohibitive lock in many production environments.

postgres=# REINDEX TABLE events;

postgres=# SELECT locktype, transactionid, mode, granted
             FROM pg_locks WHERE relation = 'events'::regclass;
 locktype | transactionid |   mode    | granted
----------+---------------+-----------+---------
 relation |               | ShareLock | t
(1 row)

The REINDEX command takes a ShareLock on the table preventing all writes to the table. PostgreSQL 12 introduces a new command, REINDEX CONCURRENTLY, which rebuilds indexes without the heavier lock taken by REINDEX.

postgres=# REINDEX TABLE CONCURRENTLY events; 

postgres=# SELECT locktype, transactionid, mode, granted
             FROM pg_locks WHERE relation = 'events'::regclass;
locktype | transactionid |           mode           | granted
----------+---------------+--------------------------+---------
 relation |               | ShareUpdateExclusiveLock | t
(1 row)

The new REINDEX CONCURRENTLY command takes a ShareUpdateExclusiveLock, which only prevents concurrent changes to the table structure. The reduced locking of REINDEX CONCURRENTLY comes as the price of longer index build times, but reads and writes are still allowed on the table keeping it online during the operation.

While performing maintenance operations like rebuilding indexes, knowledge of the progress of the operation is often required. Index builds are resource intensive commands consuming a high amount of CPU resources. With limited maintenance windows, administrators frequently must make decisions to allow an operation to continue or stop it. PostgreSQL 12 introduces a new system view, pg_stat_progress_create_index. This view shows the overall progress of an index creation. This is useful for estimating when a command will complete.

postgres=# SELECT command, phase, blocks_total, blocks_done FROM pg_stat_progress_create_index;
       command        |             phase              | blocks_total | blocks_done
----------------------+--------------------------------+--------------+-------------
 REINDEX CONCURRENTLY | building index: scanning table |      1639345 |      220340
(1 row)

In addition to the pg_stat_progress_create_index view, PostgreSQL 12 also introduces the pg_stat_progress_cluster view that tracks the progress of a CLUSTER command.

New JSON query language

Native JSON support was first added to PostgreSQL 9.2 in 2012 with the introduction of a JSON datatype. As one of the first relational databases to support JSON, PostgreSQL created many functions and operators to interact with JSON documents through SQL. Since then, the International Organization for Standardization (ISO) has released a standard manner to interact with JSON inside of a database, the SQL/JSON path language. PostgreSQL 12 adds support for the SQL/JSON path language allowing developers the choice of how to query their JSON documents.

To demonstrate the differences in methods of querying JSON documents, consider a document containing information about an event that has several keys including a date and a language. One of the documents may look like this.

{
  "date": ["2000/10/01"],
  "lang": ["en"],
  "category1": ["October"],
  "description": ["The 2000 Summer Olympics close in Sydney, Australia."],
  "granularity": ["year"]
}

Many of these documents can be stored in a simple events table like this.

CREATE TABLE events (
  id    uuid,
  event jsonb
);

Before PostgreSQL 12, to search for all of the events with a date during the year 2000 in English, we could use a query like this.

SELECT * 
  FROM events 
WHERE event->'date'->0 >= '"2000/01/01"' 
  AND event->'date'->0 < '"2001/01/01"'
  AND event->'lang'->0 = '"en"';

This query extracts the values for the keys of date and lang and then uses SQL to determine if the values in the document satisfy the conditions of the query. This functions well but from a performance stand point, values are extracted three times, once for each criteria in the WHERE clause.

With the introduction of the SQL/JSON path language, a new jsonpath type is available. The jsonpath type is a binary representation of a SQL/JSON path expression that can be used to efficiently query a complex JSON document. The SQL/JSON path expression that matches the query to return all of the events during the year 2000 in English looks like this.

'$.date &gt;= "2000/01/01" &amp;&amp; $.date &lt; "2001/01/01" &amp;&amp; $.lang == "en"'

The SQL/JSON path language has a flexible syntax allowing JSON documents to be queried in powerful ways. Expressions can be constructed to recursively match any key or target a specific key at a specific level. The full syntax of the support SQL/JSON path language is available in the PostgreSQL documentation.

PostgreSQL 12 introduces two new built-in functions to compare a jsonpath expression to the JSON documents in our database. The jsonb_path_match function returns true if the document matches the entire expression. The other function, jsonb_path_exists, returns true if the document matches any piece of the expression. For our query, we constructed an expression that must fully match the document so the function jsonb_path_match is needed for the query. To query the events table using the jsonb_path_match function, the query looks like this:

SELECT *
  FROM events 
 WHERE jsonb_path_match(event, 
'$.date >= "2000/01/01" && $.date < "2001/01/01" && $.lang == "en"'); 

Like most features with PostgreSQL, the new jsonpath functions have corresponding operators to simplify queries. The operator corresponding to the jsonb_path_match function is “@@” and the operator for jsonb_path_exists is “@?” so using the operator, the query using jsonb_path_match can look like this:

SELECT *
  FROM events 
 WHERE event @@ '$.date >= "2000/01/01" && $.date < "2001/01/01" && $.lang == "en"';

Using the jsonpath operators over the functions allows for potential future performance enhancements. As different index types support the new SQL/JSON path language, they leverage the jsonpath operators. This future proofs queries using the new syntax.

Generated columns

When working with complex data such as JSON documents, your queries may not perform at the levels needed by your application, even when the more advanced indexing methods are applied. Typically, this type of data has keys that are frequently used by most if not all queries. A technique to optimize this type of workload is to pull these keys out of the larger document and store them in dedicated columns. Before PostgreSQL 12, the best option for doing this would be to use a trigger to populate the column from the complex data. The downside of triggers is the performance overhead they introduce for all insert and update operations on the table.

PostgreSQL 12 introduces generated columns allowing a column to be computed from the value of another column. Generated columns are useful for frequently calculated values from a table. They can be simple mathematical expressions or reference complex user-defined functions. There is some performance overhead for insert and update operations but significantly less than using a trigger.

For an example of when generated columns are useful, consider our events table this is frequently queried by date. PostgreSQL has powerful date arithmetic functionality that allows you to easily target events in a specified date range such as the last 30 days. To use that functionality, the date information in the JSON document must be pulled out and converted to an actual PostgreSQL date. This can be done using a user defined function like this.

CREATE FUNCTION get_date(p_event jsonb)
  RETURNS date AS
$$
  SELECT to_date(
          jsonb_path_query(
              p_event, 
              '$.date[0]')::text, '\"YYYY/MM/DD\"');
$$ LANGUAGE sql 
IMMUTABLE;

That lets us run a query like this.

SELECT * 
  FROM events 
 WHERE get_date(event) >= current_date - 30;

This allows us to use the PostgreSQL date functionality, but we pay a performance penalty by needing to extract out the date information for every row in the table every time we run the query. Generated columns let us pay the performance penalty only once while writing the row letting the queries perform significantly faster. We can add the generated column like this.

ALTER TABLE events 
  ADD COLUMN event_date date 
  GENERATED ALWAYS AS (get_date(event)) STORED; 

Now, the query becomes just simply this.

SELECT * 
  FROM events 
 WHERE event_date >= current_date - 30;

With the event data as a generated column, it can be used in most cases like regular column. Indexes and constraints can be added to the column based on application needs. However, there are a few limitations such as generated columns cannot be used as partition keys.

Partitioning improvements

PostgreSQL 10 introduced declarative partitioning allowing large tables to be split into smaller, more manageable pieces. PostgreSQL 11 improved declarative partitioning by adding hash partitioning, primary key support, foreign key support, and partition pruning at execution time. PostgreSQL 12 continues to add to the partitioning functionality. The most noticeable enhancement is a performance improvement when running queries against a partitioned table. For some applications, a large number of partitions may be needed. Some application may need daily partitions going back seven years or others may need a partition for each customer resulting in thousands of partitions. Before PostgreSQL 12, the algorithm to eliminate partitions needed by a query was inefficient.

To demonstrate this inefficient algorithm, consider a simple transactions table that has 5000 partitions that is range partitioned by the primary key, trans_id.

CREATE TABLE transactions (
  trans_id bigserial PRIMARY KEY,
  trans_date date,
  amount numeric,
  status int
) PARTITION BY RANGE (trans_id);

When running a query against the primary key utilizes an index, but on PostgreSQL 11, the query planning time against the “transactions” table is prohibitively long. The execution plan for a simple query is here.

postgres=# EXPLAIN ANALYZE SELECT *
postgres-# FROM transactions WHERE trans_id = 96781536;
                                 QUERY PLAN
---------------------------------------------------------------------
 Append  (cost=0.15..8.18 rows=1 width=48) 
         (actual time=0.011..0.012 rows=0 loops=1)
   ->  Index Scan using transactions_96800000_pkey 
       on transactions_96800000
         (cost=0.15..8.17 rows=1 width=48) 
         (actual time=0.009..0.010 rows=0 loops=1)
         Index Cond: (trans_id = 96781536)
 Planning Time: 1758.868 ms
 Execution Time: 0.261 ms
(5 rows)

The execution time for this query is only 0.261 ms, but the planning time to determine the execution path is more than 1.7 seconds. This long planning time is proportional to the number of partitions, which makes having a large number of partitions in older version of PostgreSQL typically not feasible. PostgreSQL 12 addresses this by changing the planning algorithm to be significantly more efficient for a large number of partitions. The execution plan for the same query in PostgreSQL 12 is here.

postgres=# EXPLAIN ANALYZE SELECT *
postgres-# FROM transactions WHERE trans_id = 96781536;
                                 QUERY PLAN
---------------------------------------------------------------------
 Index Scan using transactions_96800000_pkey on transactions_96800000   
   (cost=0.15..8.17 rows=1 width=48) 
   (actual time=0.010..0.011 rows=0 loops=1)
   Index Cond: (trans_id = 96781536)
 Planning Time: 0.112 ms
 Execution Time: 0.036 ms
(4 rows)

The planning time for this query has reduced to 0.112 ms and the execution time has also reduced to 0.036 ms. This results in more than a 10,000 times performance improvement.

With the partitioning performance improvements in PostgreSQL 12, adding partitions can become a lot more common. Before PostgreSQL 12, attaching a new partition to an existing table required the entire table to be fully locked preventing all reads and writes. Many applications cannot afford a momentary outage that this causes.

This stored procedure is an example that adds a partition to the “transactions” table.

CREATE PROCEDURE add_partition(p_from_id bigint, p_to_id bigint)
AS $$
DECLARE
  sql text;
  l_name text;
BEGIN
  IF does_partition_exist(p_from_id, p_to_id) THEN
    RAISE EXCEPTION 'Partition range already exists for % to %', 
                    p_from_id, p_to_id;
  END IF;

  l_name := 'transactions_' || p_to_id;
  sql := 'CREATE TABLE ' || l_name ||
         ' (LIKE transactions INCLUDING DEFAULTS INCLUDING CONSTRAINTS)';
  EXECUTE sql;

  sql := 'ALTER TABLE transactions ATTACH PARTITION ' || l_name ||
         ' FOR VALUES FROM ' ||
         '(' || p_from_id || ') TO (' || p_to_id || ')';
  EXECUTE sql;
END
$$ LANGUAGE plpgsql;

When running this stored procedure on PostgreSQL 11, there are two locks that can be seen in the pg_locks system view.

postgres=# select locktype, mode, granted 
           from pg_locks where relation = 'transactions'::regclass;
 locktype |        mode         | granted
----------+---------------------+---------
 relation | AccessShareLock     | t
 relation | AccessExclusiveLock | t
(2 rows)

The first is an AccessShareLock. This lock is common and occurs when reading the table and it prevents changes to the table structure while the transaction is in progress. The second lock is an AccessExclusiveLock, which is a very heavy lock. This lock prevents all access to the table including reads until the transaction is complete and the lock is released.

When running the same stored procedure on PostgreSQL 12, the acquired locks are different.

postgres=# SELECT locktype, mode, granted 
             FROM pg_locks WHERE relation = 'transactions'::regclass;
 locktype |           mode           | granted
----------+--------------------------+---------
 relation | AccessShareLock          | t
 relation | ShareUpdateExclusiveLock | t
(2 rows)

The same AccessShareLock is acquired, but instead of an AccessExclusiveLock, a ShareUpdateExclusiveLock is taken. A ShareUpdateExlusiveLock is lighter weight than an AccessExclusiveLock allowing both reads and writes while the lock is held. It essentially just prevents concurrent changes to the table structure. This reduced lock strength allows new partitions to be added to a table without downtime.

With the increased functionality of partitioning in PostgreSQL, more tables are leveraging it. Before PostgreSQL 12, a partitioned table was not able to be a reference for a foreign key constraint limiting the referential integrity of an application. PostgreSQL 12 introduces the ability for foreign key constraints to reference them.

ALTER TABLE transaction_lines 
  ADD CONSTRAINT trans_fk FOREIGN KEY (trans_id) 
  REFERENCES transactions (trans_id);

Configuration parameters

Every major version, PostgreSQL introduces new configuration parameters and at times, changes the defaults for existing parameters. Many of these new parameters are knobs controlling new PostgreSQL. The Amazon Relational Database Service (RDS) PostgreSQL 12 default parameter group sets the new configurations to values that work well for most applications, but some fine tuning may be needed based on workload or business rules. When creating an Amazon RDS PostgreSQL 12 database instance, pay particular attention to the following parameters as they are either new or differ from the PostgreSQL community default values.

ssl_min_protocol_version = TLSv1.2

This new parameter sets the minimum SSL/TLS protocol version allowed to connect to the database server. The default of the PostgreSQL community is TLSv1, but Amazon RDS PostgreSQL defaults to a higher level of security.

jit = off

Just-in-Time (JIT) compilation is the ability to turn parts of a complex SQL expression into a native runtime allowing for faster execution of some queries. This feature was first introduced in PostgreSQL 11 and the default of the PostgreSQL community turns this feature on. This feature can cause a performance regression for some use cases so the default of Amazon RDS PostgreSQL 12 is to turn this feature off.

log_transaction_sample_rate = 0

This new feature allows for a portion of all transactions to be written to the log file. Some applications have an extremely high transaction rate that turning on logging for all statements would overwhelm the logs. This parameter can be set to adjust the fraction of transactions written to the logs. For example, setting this parameter to 0.01 writes 1% of the transactions to the logs.

tcp_user_timeout = 0

Some environments have many stale connections due to unreliable network connectivity. PostgreSQL 12 introduces a new feature addressing these situations controlled by this parameter. It specifies the time in milliseconds that a connection waits for acknowledgement before terminating the connection.

plan_cache_mode = auto

When PostgreSQL runs a prepared statement, it can either use a custom plan based on the specific parameters used or a generic plan, which saves planning time. PostgreSQL has always automatically chosen between those options based on the number of times a prepared statement has been executed. PostgreSQL 12 allows that behavior to be overridden with this parameter.

extra_float_digits = 0

This parameter has been available in PostgreSQL for a number of years, but the behavior has changed in PostgreSQL 12. It adjusts the number of digits used for the textual output of floating-point values. The default of the PostgreSQL community sets this value to 1, which outputs the value in the shortest precise format. This new behavior has a performance advantage over the old default of 0, which rounded the result. Amazon RDS PostgreSQL 12 retains 0 as the default keeping the output of floating-points consistent across versions.

hll.force_groupagg = false

Some extensions have parameters that change across versions. The extension, postgresql-hll, creates a HyperLogLog datatype and corresponding functions to use that datatype. Storing data in an hll structure can increase performance for some analytical use cases. This parameter disables the use of hash aggregation and forces group aggregation for queries using the hll datatype. Forcing group aggregation can increase the performance of some queries.

Summary

In this blog post, I explored many of the great new features in PostgreSQL 12. Amazon RDS for PostgreSQL 12 is now available allowing access to the PostgreSQL community’s many impressive features, plus all the benefits of Amazon RDS. Amazon RDS PostgreSQL provides managed database capabilities like automatic failover, backup and recovery, and automated patching. There are also more than 60 extensions available adding the already impressive capabilities of PostgreSQL.

With all the great new features in PostgreSQL 12, the largest new development is beneath the surface waiting for future versions of PostgreSQL. A new internal architecture creating an API for Table Access Methods lays the ground work for next generation work. Table Access Methods allows PostgreSQL to support new storage formats such as columnar. Expect to see many new features in future versions of PostgreSQL as it continues to innovate.

 


About the Author

Jim is a Principal Database Engineer at AWS. Prior to joining AWS, Jim was CTO at OpenSCG and CEO at StormDB, a Postgres-XC scalable public cloud solution. Jim was chief architect and part of the founding team of EnterpriseDB. Jim also very active in the PostgreSQL community, co-founding both the New York and Philadelphia PostgreSQL Meetup groups, co-organizing the largest PostgreSQL conferences, and serving on the Board of the PostgreSQL Foundation. Jim also speaks regularly at PostgreSQL and other open sources conferences.