AWS Database Blog
Dive into new functionality for PostgreSQL 11
The initial PostgreSQL project started as a university project in 1986. The open source community took over the PostgreSQL project in 1996, and has consistently released a major version every year. Given the software’s complexity, this rapid release schedule has required the segmentation of major features into smaller, more basic elements. These smaller, foundational features combine to make up every major release of PostgreSQL, including PostgreSQL 11. PostgreSQL 11 is supported by both Amazon RDS for PostgreSQL and Amazon Aurora with PostgreSQL compatibility.
In this post, I take a close look at three exciting features in PostgreSQL 11: partitioning, parallelism, and just-in-time (JIT) compilation. I explore the evolution of these features across multiple PostgreSQL versions. I also cover the benefits that PostgreSQL 11 offers, and show practical examples to point out how to adapt these features to your applications.
Partitioning
As databases grow, a few tables typically are sources of the majority of the growth. Growth may center in tables that keep a historical log of all activity or in your user table.
You can deal with the high database growth through table partitioning, which means splitting a single large table into smaller, more manageable chunks (partitions) to speed up queries on those large tables. Performance improves when the database can prune away whole partitions during query execution, processing much less data. The concept of table partitioning isn’t new in PostgreSQL 11. PostgreSQL first introduced a form of table partitioning in version 8.1, released in 2005.
For example, consider the following orders
table. An application adds a new row to this table for every sales order. As more sales occur, this table gets larger by the day. Tables that grow over time like this are prime candidates for time-based partitioning:
In version 8.1 through 9.6 of PostgreSQL, you set up partitioning using a unique feature called “table inheritance.” That is, you set up yearly partitions by creating child tables that each inherit from the parent with a table constraint to enforce the data range contained in that child table.
The PostgreSQL optimizer then uses a feature called constraint exclusion, which uses constraints to prune away partitions at query time and increase query performance:
Although constraint exclusion is effective, the method has a downside for the child inheritance model. Specifically, constraint exclusion complicates the loading data into the partitioned tables. As a result, a trigger is needed on the parent table to move rows into the appropriate child:
PostgreSQL 10 offers a declarative partitioning syntax. You define the partitioning method and partitioning key. PostgreSQL 10 supports the range and list partitioning methods.
In addition to the clearer partitioning syntax, PostgreSQL 10 eliminates the need for triggers on parent tables. Without triggers, you can partition tables without code, and partitioning syntax makes the table easier to maintain. PostgreSQL now automatically routes rows to the correct partition, improving both data loading and maintenance.
As the name implies, range partitioning handles a range of values. For example, in time series data, a date range may be used to create daily, monthly, or yearly partitions. This is a perfect fit for our orders
table:
Use list partitioning for cases with a limited number of fixed values. You define each partition’s list of values during partition creation. Here, I use list partitioning for the customer
table, partitioning by Region with a list of nations belonging to each Region:
The right choice of partitioning method and partitioning key depends on the application. Your choice could even evolve as your data changes. You may need some range partitioning and some list partitioning in the same schema.
PostgreSQL 11 extends this syntax further, enabling you to define a default partition. A default partition helps if you obtain data with values outside of established partition ranges. For example, in the orders
table, I have yearly partitions going back to 2015. But I could add a row with an empty order date or, for example, January 1, 1970. Rows with values outside the established partition range default to this row instead of causing an error:
Index creation on partitioned tables is also simpler in PostgreSQL 11. In PostgreSQL 10, the only way to create local indexes on all the partitions is to create them individually on each partition. Attempting to create indexes on the main parent table returns an error.
For example, in the following statement, I try to create an index on the orders
table in PostgreSQL 10. The attempt predictably returns an error. I then create the index on the partition:
By contrast, when I attempt to create an index on the orders
table in PostgreSQL 11, the statement succeeds in building local indexes on all the partitions:
PostgreSQL 11 also introduces a hash partitioning method that adds to the range and list methods introduced in PostgreSQL 10. Hash partitioning is useful for large tables containing no logical or natural value ranges to partition.
In my sales database, the part
table offers a perfect candidate for hash partitioning. Although the table is designed to grow over time, time values do not determine the table rows. Older data may be referenced just as frequently as new data. This table represents an ideal use case for hash partitioning on the table’s primary key:
Now I only need to decide how many partitions to create. The ideal number of partitions depends on the application, but if I wanted to have 10 partitions, I would define the partition with a modulus of 10:
PostgreSQL 11 provides even further enhancements in partition pruning, which means removing the need to scan whole partitions for a query. PostgreSQL 10 and earlier versions prune partitions away while a query is being planned. If provided with enough information to rule out a value range, PostgreSQL eliminates the corresponding partition from the scan. The needed information is typically a WHERE
condition that contains a constant. This method works for simple queries.
Of course, you typically access large partitioned tables by joining them to other smaller tables. But PostgreSQL 11 allows you to prune away partitions at execution time. The result of a join
removes the scans of unneeded partitions. The following query shows examples of both types of partition pruning:
The query joins together two tables that are both hash-partitioned. The WHERE
condition of the query "WHERE lineitem.l_orderkey = 215528935"
allows PostgreSQL to prune away all but a single partition during query planning.
You can see this in the following PostgreSQL query plan, in the line starting with "Index Scan using lineitem_p78_pkey on public.lineitem_p78."
The query plan never considers the other 99 partitions of line item:
The scan of the “part” table behaves differently. Multiple lines start with “Index Scan using part_pN_pkey
.” The partkey
table’s values are unknown during query planning. So PostgreSQL must consider scanning all the “part” table’s partitions. After PostgreSQL scans the lineitem
table, PostgreSQL knows the values of partkey
. PostgreSQL can then prune away the unnecessary partitions at execution time.
The query plan reflects this in the “Index Scan” lines containing the text “(never executed).” PostgreSQL can prune away five of the 10 partitions based on the rows returned from lineitem
.
Parallelism
Older versions of PostgreSQL use a single CPU core for the whole execution of each query. PostgreSQL started moving away from reliance on one CPU per query in version 9.6, which first introduced parallelism.
Parallelism allows PostgreSQL to execute a query using multiple CPUs. To accomplish this, background workers share the query execution load and hand back the intermediary results to a leader process that combines and sends everything to the client. As a result, PostgreSQL can now run queries faster.
For example, look at a simple query that is standard for many businesses: a monthly sales average. The following query calculates the average sale price by month for the first quarter of 2018 against the orders
table defined earlier:
PostgreSQL sequentially scans the entire 2018 partition of the orders
table and then calculates the average of millions of values. Consequently, the results take more than 13 seconds to return.
With PostgreSQL 9.6 adding parallel sequential scans and parallel aggregates, you can execute the monthly sales average query using multiple CPUs. By turning up the parallelism to 2, you get the results more than twice as fast:
PostgreSQL 10 enables you to run additional query execution operations in parallel. Parallel index scans make up the most common case, though some operations include parallel merge joins and parallel execution of stored functions. As tables grow larger, so do their indexes. When running queries against large indexes, parallel scanning help you obtain a faster result.
In the following query, I narrow the average sales price query down to a business week, which uses the index created earlier:
Using the index, the query obtains results in 1.5 seconds. By turning up the parallelism to 2, I obtain the results more than twice as fast:
Improving upon these established features, PostgreSQL 11 also introduces the ability to run maintenance operations in parallel. For example, PostgreSQL 11 enables the creation of indexes in parallel. Rapid execution of maintenance operations is critical to end users’ performance. Consequently, these operations typically run during a maintenance window.
In the following statement, I create indexes for all the partitions of the customer
table:
The operation completes in a little over 20 seconds. By turning up the parallelism for maintenance commands to 4 for this session, this index is created almost three times as fast:
JIT compilation
One new feature offered with PostgreSQL 11 is just-in-time (JIT) compilation. JIT compiles portions of a SQL expression into a native program that the CPU can execute quickly. Generating these programs incurs a high cost. On some queries, JIT program generation and query optimization can take over a second. After the query is optimized, it can run faster than the interpreted SQL expressions.
Typically, queries that perform calculations on many rows display faster query times. In the following query, I calculate the total sales, average sales price, minimum sale, and maximum sale by sales status over three years:
This query returns results from the three aggregated rows in about 51 seconds. After I turn on JIT, the query returns results on the same three rows in about 35 seconds:
By enabling JIT, I can improve this query’s performance by 30%. For those queries taking tens of seconds or longer to run, JIT can deliver a significant performance gain. For most common queries, however, enabling JIT slows them down to the extent of disabling the feature by default. This feature shows promise and the infrastructure is in place to improve the features and functionality of JIT in future versions of PostgreSQL.
Summary
In this post, I explored just three of the major improvements in PostgreSQL 11: partitioning, parallelism, and JIT compilation. PostgreSQL 11 also brings you transaction control in stored procedures, covering indexes, and several performance improvements. For a full list of features in PostgreSQL 11, check out the documentation at PostgreSQL Version 11.1 on Amazon RDS.
Combine PostgreSQL 11 with Amazon Aurora with PostgreSQL compatibility to gain access to the PostgreSQL community’s many impressive features, plus all the benefits of Aurora. Aurora provides managed database capabilities like automatic failover, backup and recovery, automated patching, and push-button scaling. Aurora storage offers improved performance and durability, and this service also includes more advanced features such as fast database cloning, query plan management, and cluster cache management.
Looking forward to PostgreSQL 12, you can expect to see more usability improvements on partitioning, the start of the SQL/JSON path implementation, and much more. Although the PostgreSQL database is more than 30 years old, the project’s rate of innovation continues to accelerate.
About the Author
Jim is a Principal Database Engineer at AWS. Prior to joining AWS, Jim was CTO at OpenSCG and CEO at StormDB, a Postgres-XC scalable public cloud solution. Jim was chief architect and part of the founding team of EnterpriseDB. Jim also very active in the PostgreSQL community, co-founding both the New York and Philadelphia PostgreSQL Meetup groups, co-organizing the largest PostgreSQL conferences, and serving on the Board of the PostgreSQL Foundation. Jim also speaks regularly at PostgreSQL and other open sources conferences.