Build hypothetical indexes in Amazon RDS for PostgreSQL with HypoPG
Indexes in PostgreSQL are essential for improving the performance of database queries. They serve as data structures that organize and optimize the retrieval of information from database tables. By creating indexes on specific columns, PostgreSQL can locate and access relevant data more efficiently. Indexes work by creating a separate data structure that contains a sorted copy of the indexed column’s values along with a reference to the original table’s rows. This enables PostgreSQL to perform quick lookups, filtering, and sorting operations. With indexes, the database engine can significantly reduce the number of disk reads required to satisfy a query, leading to faster query runtimes and improved overall database performance. Moreover, indexes enable PostgreSQL to run complex queries more efficiently, especially when involving joins or aggregations, because they can narrow down the search space and reduce the amount of data that needs to be processed.
In many cases, the most difficult question to answer is: do I really need to add an index to this table or column? It’s this question that HypoPG assists in solving. In this post, we discuss using HypoPG with Amazon Relational Database Service (Amazon RDS) for PostgreSQL.
The HypoPG PostgreSQL extension addresses two primary problems related to query optimization and testing in PostgreSQL:
- Hypothetical indexes – One of the main issues faced by developers and database administrators is determining the potential impact of creating or dropping an index on query performance without actually modifying the database schema. HypoPG allows the creation of hypothetical indexes that exist only during the run of a single session. These indexes can’t be used for running queries that return actual production data. Instead, it allows developers to test different index configurations without any additional CPU, disk, or memory impact to the database. This enables developers to evaluate the impact of potential index changes and make informed decisions.
- Statement-level statistics – PostgreSQL collects and maintains statistics at the table and index level, which are used by the query optimizer to estimate the cost of different query plans. However, it lacks built-in support for collecting statistics at the statement level, making it challenging to analyze the performance of individual SQL statements within a function, stored procedure, or transaction. The HypoPG extension addresses this limitation by providing statement-level statistics, allowing developers to understand the performance characteristics of specific SQL statements and identify areas for optimization.
By solving these problems, the HypoPG PostgreSQL extension empowers developers and administrators to make informed decisions regarding index creation and modification, as well as gain deeper insights into the performance of individual SQL statements within their database environment.
To start using HypoPG in Amazon RDS for PostgreSQL, you must complete the following high-level steps:
- Create a new schema.
- Enable the HypoPG extension.
- Verify the installation.
In this post, we assume that you have an RDS for PostgreSQL instance provisioned, with appropriate access rights configured such that it can be reached by PSQL.
Configure HypoPG in Amazon RDS for PostgreSQL
Complete the following steps to get started using HypoPG:
- In your PostgreSQL client, create a new schema where you’ll install the HypoPG extension. Run the following SQL command:
- In your PostgreSQL client, run the following commands to install the HypoPG extension:
- To confirm that the extension is installed successfully, run the following command:
If the installation was successful, you should see the version number of the installed HypoPG extension (in addition to version information for other installed PostgreSQL extensions).
Create hypothetical indexes in your queries using HypoPG
You can now start using HypoPG to create and test hypothetical indexes in your queries. To create a hypothetical index, use the
First, let’s create a sample table and insert some data into it:
Now we can create an explain plan, which will describe how the PostgreSQL query planner intends to run this query:
Using HypoPG, we create a hypothetical index on the
location(id) column to check if this index will be useful to the query planner. Then we can rerun the explain plan. Because HypoPG is only accessible in session where specifically enabled, ensure all parts of this tutorial are run in the same PostgreSQL session.
The query plan changed its run such that what was a sequential scan is now an index scan when using the newly created virtual index. Because this change will enable an improvement in performance to this query, we now create a PostgreSQL index based on the feedback gained using HypoPG:
Remember that hypothetical indexes are only active within the session, so they won’t affect the actual database schema. You can test different index configurations and evaluate their impact on query performance.
When to use HypoPG
Imagine you have a complex database with multiple tables, and you need to optimize a specific query that involves joining several tables and applying various filtering conditions. The query is critical for your application, and you want to improve its performance by creating appropriate indexes.
When using HypoPG, we strongly recommend not to use this extension on production workloads directly. The addition of hypothetical indexes can add additional compute overhead to your database, which may cause your workload to behave in unexpected patterns (related to resource utilization). One simple way to test using HypoPG is to create and restore a snapshot of your production database into a lower-level environment like dev or test.
In this scenario, HypoPG can be incredibly useful. With HypoPG, you can create hypothetical indexes on the relevant columns without actually modifying the database schema. You can then determine whether an index as specified in the hypothetical index could be used to optimize the query, without risking any adverse effects on the production system.
For instance, you can create multiple hypothetical indexes on different combinations of columns and run the query using each index. By comparing the runtimes and query plans, you can determine which index configuration is likely to yield the best performance improvement for your specific query.
Creating indexes on large tables can take a long time (often measured in hours, depending on the workload). HypoPG allows for the creation of hypothetical indexes much more quickly, allowing for faster iteration when searching for ideal index additions to a given table.
HypoPG also enables you to assess the potential benefits of different index types, such as B-tree, hash, or GiST indexes, without the need to create them physically. This flexibility allows you to experiment with various indexing strategies and choose the most suitable one for your query without the impact usually associated with creating indexes against tables in PostgreSQL (especially very large tables).
With HypoPG, you can make informed decisions about index creation and modification, leading to optimized query performance without the need for extensive trial and error on a live (production) database. HypoPG empowers you to fine-tune your indexes based on accurate performance analysis, ensuring that your critical queries run efficiently while maintaining the integrity of your production workloads.
When not to use HypoPG
Although HypoPG is a useful extension for testing and optimizing index configurations in PostgreSQL, there are certain scenarios where using HypoPG may not be ideal.
For example, consider a situation where you have a large production database with a high volume of concurrent transactions and frequent updates. In such cases, the overhead introduced by HypoPG’s hypothetical indexes can negatively impact the overall performance of the database.
Because HypoPG maintains additional metadata and performs extra computations to support hypothetical indexes, it can introduce increased storage requirements and additional processing overhead during query runs. If the database is already under significant load, adding hypothetical indexes through HypoPG might exacerbate performance issues rather than improve them.
Additionally, using HypoPG requires careful consideration and testing to accurately simulate the impact of hypothetical indexes on real-world query workloads. In complex scenarios involving multiple tables, joins, and aggregations, the assumptions made by HypoPG’s hypothetical indexes may not accurately reflect the behavior of the actual indexes in the production environment. This can lead to misleading results and suboptimal decision-making regarding index creation or modification.
In such cases, it is crucial to thoroughly test and evaluate the performance implications of any index changes directly on a non-production environment that closely resembles the production setup.
For example, PostgreSQL employs a mechanism called hot updates in order to modify data within a table without requiring a full rewrite of the entire row. This mechanism is applied when a row has enough free space to accommodate the updated values, and it helps to minimize overhead usually associated with updating data. When adding an index, hot updates are converted into regular updates (which rewrites the entire row). This means that adding indexes will almost always increase the write overhead required when making updates to tables where indexes exist. The more indexes added, the higher the impact that can be expected. Because of the associated impact to write overhead, use HypoPG to ensure you only add indexes where required, and eliminate any unused or unneeded indexes in order to remove write overhead from tables where indexes are deployed.
By using hypothetical indexes (via HypoPG) to test whether the addition of indexes can speed up query runtime, we can avoid the unnecessary overhead and time associated with creating real indexes on tables. This allows us to iterate and troubleshoot query run performance faster than we could otherwise.
HypoPG best practices
Consider the following best practices:
- Understand the purpose – Familiarize yourself with the purpose and limitations of HypoPG. It is designed to assist in testing and evaluating index configurations without modifying the production database schema. Ensure that you have a clear understanding of how hypothetical indexes work and their impact on query performance.
- Use in non-production environments – HypoPG is best suited for non-production environments, such as development or staging databases. Avoid using HypoPG in a live production environment because it can introduce additional overhead and potential risks to the stability and performance of your production system.
- Thoroughly test your assumptions – Before applying any index changes based on HypoPG results, thoroughly test the proposed configurations on a representative non-production environment that closely resembles the production setup. Evaluate the performance impact on a realistic workload, using actual indexes that HypoPG has identified as being potentially beneficial, to ensure that the observed improvements hold true.
- Measure performance metrics – When using HypoPG, measure and compare performance metrics, such as runtime, query plans, and resource utilization, both with and without hypothetical indexes using Amazon CloudWatch and Amazon RDS Performance Insights. This analysis will help you accurately assess the impact of different index configurations on your specific query workload.
- Avoid exhausting your database compute resources – Be mindful of the additional overhead introduced by HypoPG’s hypothetical indexes. Running multiple HypoPG experiments concurrently or on a heavily loaded system can impact overall database performance. Take precautions to avoid overloading the system during testing and analysis.
HypoPG is a powerful PostgreSQL extension for testing and evaluating index configurations in PostgreSQL before going to the expense of building actual indices on your database. It also provides statement-level statistics to more accurately measure query run costs. Although this solution doesn’t apply to every use case, it’s an important tool to add to your toolbox, allowing you quickly evaluate how to get the most out of PostgreSQL.
We welcome your comments and feedback in the comments section.
About the Authors
Peter Celentano is a Specialist Solutions Architect with Amazon Web Services, focusing on managed PostgreSQL. He works with AWS customers to design scalable, secure, performant, and robust database architectures on the cloud.