AWS Database Blog

Index types supported in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL (GIN, GiST, HASH, BRIN)

In PostgreSQL, indexes are vital because they allow efficient access to specific data records based on the values in specified columns. PostgreSQL offers a variety of index types to address different use cases and optimize query performance. Some of the most commonly used index types include: B-tree, GiST (Generalized Search Tree), GIN (Generalized Inverted Index), HASH, BRIN (Block Range Index) indexes. Additionally, PostgreSQL’s index infrastructure is extensible, allowing developers to create custom index types tailored to their specific requirements. This extensibility is achieved through the use of operator classes, which define the behavior of indexes for specific data types and operations.

In Part 1 of this series, we discussed the B-tree index and its various use cases in detail. In this post, we discuss other native indexes supported in Amazon Aurora PostgreSQL-Compatible Edition and Amazon Relational Database Service (Amazon RDS) for PostgreSQL, including GIN, GiST, HASH, and BRIN, and their use cases.

In the following sections, we discuss the strategies to apply when creating an index depending on your data retrieval needs and use case.

Prerequisites

To follow along, you should have an RDS for PostgreSQL or Aurora PostgreSQL database and a client machine with psql installed and a connection to the database. For more information, refer to Connecting to a DB instance running the PostgreSQL database engine.

Refer to the following script to generate the sample dataset. This script is generating a sample dataset for an employee table with various columns containing different data types like character strings, integers, dates, arrays, JSONB, etc. It’s using PostgreSQL functions like RANDOM(), REPEAT(), GENERATE_SERIES() to generate random sample data which will used throughout this post.

CREATE TABLE employee ( 
      emp_name          CHARACTER VARYING COLLATE "C"
    , emp_id            BIGINT
    , emp_salary        BIGINT
    , emp_dept           CHARACTER VARYING
    , emp_doj           DATE
    , emp_lwd           DATE
    , emp_age           INTEGER
    , emp_cities        TEXT
    , emp_address_hist  JSONB
    , emp_city_list     TEXT[]
    , emp_tenure        TSRANGE
    ); 

INSERT INTO employee
SELECT
substr(REPEAT('abexoptkhngsadl',3),((random()*(30-1)+1)::integer),5) AS emp_name
,I                                                                   AS emp_id
,TRUNC(RANDOM() * 10000)							  AS emp_salary
,UPPER(substr(REPEAT(REVERSE('abexoptkhngsadl'),3),((random()*(30-1)+1)::integer),4)) AS emp_dept
, '1/1/2016'::DATE + ('1 YEAR'::INTERVAL * FLOOR(RANDOM()*5))       AS emp_doj
, '1/1/2020'::DATE + ('2 YEAR'::INTERVAL * FLOOR(RANDOM()*6))       AS emp_lwd
, floor(random() * (39-30+1) + 30)::int    			AS emp_age
,'CITY'||I||',CITYX,'||'CITY'||I%2         			AS emp_cities
,TO_JSONB(ROW(I::TEXT, 'CITY'||I||',CITYX,'||'CITY'||I%2))  AS emp_address_hist
,ARRAY[CONCAT_WS('','CITY',I),'CITYX',CONCAT_WS('','CITY',I%2)]   AS emp_city_list
,('[2015-'||I%12||'-'||I%30||', 2017-'||I%12||'-'||I%30||')')::TSRANGE  AS emp_tenure
FROM GENERATE_SERIES(1, 5000000) AS I
WHERE i%12 BETWEEN 1 AND 13 AND i%30 BETWEEN 1 AND 28;

GIN index

A Generalized Inverted (GIN) index is commonly used when the items to be indexed are of composite values and searching is required for an element that appears within the composite items. It’s best suited for applications using JSONB, Arrays and full text search data types and features.

A GIN index also uses various operator classes based on content and which data type and operator class you are using—for example, jsonb_ops, gin_trgm_ops, and more. The operator class identifies the operators to be used by the index for that column, The main reason for having operator classes is that for some data types, there could be more than one meaningful index behavior.

Additionally, GIN indexes only support bitmap index scans due to the fact that they only store parts of the row values in each index page. Bitmap Index Scan means the database is performing the tree traversal and following the leaf node chain to find all matching entries. Once the database finds the matching entries, it will then sort the rows according to the physical storage location of the rows in the table and fetch the rows from the table. That last part is the Bitmap Heap Scan.

Let’s examine some use cases of when a GIN index should be used.

Full text search

A full text search is used to perform natural language searches on documents such as articles, websites, or other written formats. PostgreSQL has specific data types and built-in functions to work with full text search. You can use a GIN index to speed up your full text search. Full text search uses the built-in to_tsvector and to_tsquery PostgreSQL functions. The tsvector type is used to represent a text-searchable document, whereas the tsquery is used to represent text queries in the text query format.

The following query searches for emp_cities with “city4” and “city0”

EXPLAIN ANALYZE SELECT * FROM employee WHERE to_tsvector('english',emp_cities) @@ to_tsquery('english','city4 & city0');

Gather  (cost=1000.00..1203692.73 rows=217 width=199) (actual time=3.326..5881.213 rows=2 loops=1)
Workers Planned: 2
Workers Launched: 2
->  Parallel Seq Scan on employee  (cost=0.00..1202671.03 rows=90 width=199) (actual time=3903.022..5860.783 rows=1 loops=3)
Filter: (to_tsvector('english'::regconfig, emp_cities) @@ '''city4'' & ''city0'''::tsquery)
Rows Removed by Filter: 2888889
Planning Time: 0.549 ms
Execution Time: 5881.901 ms

Now we create a GIN index on the emp_cities column and rerun the same query:

CREATE INDEX idx_emp_gin_fts ON employee USING GIN(to_tsvector('english',emp_cities));

EXPLAIN ANALYZE SELECT * FROM employee WHERE to_tsvector('english',emp_cities) @@ to_tsquery('english','city4 & city0');

Bitmap Heap Scan on employee  (cost=45.68..951.64 rows=217 width=199) (actual time=1.020..1.043 rows=2 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, emp_cities) @@ '''city4'' & ''city0'''::tsquery)
Heap Blocks: exact=2
->  Bitmap Index Scan on idx_emp_gin_fts  (cost=0.00..45.63 rows=217 width=0) (actual time=0.349..0.349 rows=2 loops=1)
Index Cond: (to_tsvector('english'::regconfig, emp_cities) @@ '''city4'' & ''city0'''::tsquery)
Planning Time: 4.503 ms
Execution Time: 1.143 ms

JSONB

A GIN index is also useful to speed up querying JSON data. There are two operator classes available in PostgreSQL: jsonb_ops and jsonb_path_ops. jsonb_ops is used by default when no operator class is specified while creating a GIN index.

Here, we use the previous example of an employee table, which has a city list stored in JSONB format:

EXPLAIN ANALYSE SELECT jsonb_pretty(emp_address_hist) FROM employee WHERE emp_address_hist @> '{"f2": "city4,cityX,city0"}';

Gather (cost=1000.00..300980.51 rows=867 width=32) (actual time=0.439..3188.012 rows=2 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on employee (cost=0.00..299893.81 rows=361 width=32) (actual time=1768.517..3170.380 rows=1 loops=3)
Filter: (emp_address_hist @> '{"f2": "city4,cityX,city0"}'::jsonb)
Rows Removed by Filter: 2888889
Planning Time: 0.153 ms
Execution Time: 3188.033 ms

Now we create a GIN index on the emp_address_hist column and rerun the same query:

CREATE INDEX idx_emp_gin_jsonb ON employee USING gin(emp_address_hist);

EXPLAIN ANALYSE SELECT jsonb_pretty(emp_address_hist) FROM employee WHERE emp_address_hist @> '{"f2": "city4,cityX,city0"}';

Bitmap Heap Scan on employee (cost=58.72..3384.25 rows=867 width=32) (actual time=0.480..0.496 rows=2 loops=1)
Recheck Cond: (emp_address_hist @> '{"f2": "city4,cityX,city0"}'::jsonb)
Heap Blocks: exact=2
-> Bitmap Index Scan on idx_emp_gin_jsonb (cost=0.00..58.50 rows=867 width=0) (actual time=0.159..0.159 rows=2 loops=1)
Index Cond: (emp_address_hist @> '{"f2": "city4,cityX,city0"}'::jsonb)
Planning Time: 3.478 ms
Execution Time: 0.537 ms

Arrays

PostgreSQL allows columns of a table to be defined as variable-length, multidimensional arrays. You can create arrays of any built-in or user-defined base type, enum type, composite type, range type, or domain. You can use a GIN index to speed up searching elements within the arrays.

Let’s consider the same employee table, where we have data in the form of arrays in emp_city_list. Let’s query this array to get employee details for an employee in “city5”:

EXPLAIN ANALYSE SELECT * FROM employee WHERE emp_city_list @> ARRAY['city5']::text[];

Gather (cost=1000.00..305226.21 rows=43333 width=199) (actual time=2.827..2618.174 rows=2 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on employee (cost=0.00..299892.91 rows=18055 width=199) (actual time=1569.229..2596.557 rows=1 loops=3)
Filter: (emp_city_list @> '{city5}'::text[])
Rows Removed by Filter: 2888889
Planning Time: 3.665 ms
Execution Time: 2618.269 ms

Now we create a GIN index on the emp_city_list column and rerun the same query:

CREATE INDEX idx_emp_gin_array ON employee USING gin(emp_city_list);

EXPLAIN ANALYSE SELECT * FROM employee WHERE emp_city_list @> ARRAY['city5']::text[];

Bitmap Heap Scan on employee (cost=419.83..113271.77 rows=43333 width=199) (actual time=0.639..0.673 rows=2 loops=1)
Recheck Cond: (emp_city_list @> '{city5}'::text[])
Heap Blocks: exact=2
-> Bitmap Index Scan on idx_emp_gin_array (cost=0.00..409.00 rows=43333 width=0) (actual time=0.110..0.111 rows=2 loops=1)
Index Cond: (emp_city_list @> '{city5}'::text[])
Planning Time: 7.789 ms
Execution Time: 0.833 ms

In all these cases, we saw an improvement in query execution time with a GIN index, Although GIN indexes can speed up various data select operations, data updates in a GIN index are rather slow. It’s important to understand your update frequency to determine if using a GIN index is the best approach for your workload.

GIN indexes provide a few parameters like fastupdate, which we can specify during index creation or update later. This parameter asks the PostgreSQL engine to accumulate updates in a separate unordered list and update the index only when the number of accumulated updates is large enough, as defined in the gin_pending_list_limit configuration parameter.

Here are some scenarios, where using fastupdate for a GIN index can be advantageous:

Frequent Updates: If your application frequently updates the indexed data, enabling fastupdate can significantly reduce the overhead of index maintenance.

Large Datasets: For large datasets where rebuilding the entire index on every update operation is impractical or too slow, fastupdate can provide performance benefits.

High Write Throughput: In scenarios with high write throughput, such as real-time data ingestion or logging systems, fastupdate can help maintain acceptable performance levels.

Mixed Workloads: If your application has a mix of read and write operations, and write operations significantly outnumber read operations, fastupdate can help ensure that write performance is not disproportionately affected by index maintenance.

GiST index

You can use Generalized Search Tree (GiST) indexes in PostgreSQL for various types of queries and operators, such as equality operators, range queries, and partial matches. A GiST index’s mainly used with the range queries where some kind of bounds involve along with full text searches and geospatial.

Like the GIN index, GiST also supports various operator classes to extend its uses, like gist_trgm_ops, gist__int_ops, and more.

Another important feature that GiST supports is the exclusion constraint. The exclusion constraint ensures that given fields of any two table rows don’t correspond to each other in terms of some operators. The exclusion constraint compares two rows in the table, where the user can give conditions such as “no two rows overlap” or “no two rows can be different.”

Full text search

As with GIN indexes, we can use a GiST index for full text searches. For this post, we use the same example as in the previous section.

In the following code, we use the employee table and create a GiST index instead of a GIN index and rerun the same query:

EXPLAIN ANALYSE SELECT * FROM employee WHERE to_tsvector('english',emp_cities) @@ to_tsquery('english','city4 & city0');

Gather (cost=1000.00..1203692.73 rows=217 width=199) (actual time=1.822..7288.673 rows=2 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on employee (cost=0.00..1202671.03 rows=90 width=199) (actual time=3631.871..7266.289 rows=1 loops=3)
Filter: (to_tsvector('english'::regconfig, emp_cities) @@ '''city4'' & ''city0'''::tsquery)
Rows Removed by Filter: 2888889
Planning Time: 3.381 ms
Execution Time: 7288.711 ms

CREATE INDEX idx_emp_gist_fts ON employee USING GIST (to_tsvector('english', emp_cities));

EXPLAIN ANALYSE SELECT * FROM employee WHERE to_tsvector('english',emp_cities) @@ to_tsquery('english','city4 & city0');

Index Scan using idx_emp_gist_fts on employee (cost=0.41..880.21 rows=217 width=199) (actual time=10.521..53.387 rows=2 loops=1)
Index Cond: (to_tsvector('english'::regconfig, emp_cities) @@ '''city4'' & ''city0'''::tsquery)
Planning Time: 5.448 ms
Execution Time: 53.483 ms

As we can see, the performance with the GiST index is comparable to the GIN index in the case of a full text search, but a GIN index is preferred for text search because it provides more faster and GiST index on the other hand can be lossy, but you can re-rank the results, to do this you can use the `@@` operator along with a ranking function As we discussed previously, a GIN index is quite expensive when your table has frequent updates. Therefore, when choosing an index for a full text search, you should consider your use case requirements.

Range types

A GiST index is also useful for queries that deal with interval data types like tsrange. To understand this better, let’s consider the employee table and search for details for employee tenure:

EXPLAIN ANALYSE SELECT * FROM employee WHERE emp_tenure && '[2019-01-01, 2019-04-01)';

Gather (cost=1000.00..300893.01 rows=1 width=199) (actual time=1609.416..1610.704 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on employee (cost=0.00..299892.91 rows=1 width=199) (actual time=1595.718..1595.719 rows=0 loops=3)
Filter: (emp_tenure && '["2019-01-01 00:00:00","2019-04-01 00:00:00")'::tsrange)
Rows Removed by Filter: 2888890
Planning Time: 0.141 ms
Execution Time: 1610.722 ms

CREATE index idx_emp_gist_tenure_range on employee USING GIST(emp_tenure);

EXPLAIN ANALYSE SELECT * FROM employee WHERE emp_tenure && '[2019-01-01, 2019-04-01)';

Index Scan using idx_emp_gist_tenure_range on employee (cost=0.41..8.43 rows=1 width=199) (actual time=0.136..0.137 rows=0 loops=1)
Index Cond: (emp_tenure && '["2019-01-01 00:00:00","2019-04-01 00:00:00")'::tsrange)
Planning Time: 5.839 ms
Execution Time: 0.193 ms

HASH index

A HASH index stores a 32-bit hash code derived from the value of the indexed column; therefore, it can have 2³²-1 unique hash codes. You should consider using a HASH index only if the index column is used with the equality (=) operator. This is a common alternative to using a B-tree index for search conditions where only the = operator is used.

A HASH index stores only 4-byte hash values instead of actual column values, so it’s much smaller than a B-tree index.

There are many use cases for HASH index such as Equality Lookups (where you need to find rows with a specific value in an indexed column), Unique Constraint Enforcement, Join Operations (can be useful in optimizing certain types of join operations, particularly hash joins) and many more.

To understand HASH index using example here post, we search an element from an array. In PostgreSQL, you can directly create an index on an array, and if the major application operator involves only the = operator, the HASH index performs better than GIN or B-tree indexes.

For demonstration purposes, we create a composite type array and create B-tree, GIN, and HASH indexes. We examine the cost for each index type for an = operator on the composite type:

CREATE TYPE comp_type AS (id1 int, id2 int);
CREATE TABLE idx_demo_table (sn INT, ids comp_type[]);
INSERT INTO idx_demo_table SELECT i, ARRAY[(i,i)::comp_type]
FROM generate_series(1,1000000)i;

EXPLAIN ANALYSE SELECT * FROM idx_demo_table WHERE ids = ARRAY[(15,15)::comp_type];

Gather (cost=1000.00..17572.43 rows=1 width=57) (actual time=0.733..84.967 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on idx_demo_table (cost=0.00..16572.33 rows=1 width=57) (actual time=43.942..71.519 rows=0 loops=3)
Filter: (ids = '{"(15,15)"}'::comp_type[])
Rows Removed by Filter: 333333
Planning Time: 0.219 ms
Execution Time: 84.982 ms

CREATE INDEX idx_demo_table_btree ON idx_demo_table USING BTREE(ids);

EXPLAIN ANALYSE SELECT * FROM idx_demo_table WHERE ids = ARRAY[(15,15)::comp_type];

Index Scan using idx_demo_table_btree on idx_demo_table (cost=0.55..8.57 rows=1 width=57) (actual time=0.106..0.110 rows=1 loops=1)
Index Cond: (ids = '{"(15,15)"}'::comp_type[])
Planning Time: 1.252 ms
Execution Time: 0.145 ms

DROP INDEX idx_demo_table_btree;

CREATE INDEX idx_demo_table_gin ON idx_demo_table USING GIN(ids);

EXPLAIN ANALYSE SELECT * FROM idx_demo_table WHERE ids = ARRAY[(15,15)::comp_type];

Bitmap Heap Scan on idx_demo_table (cost=20.01..24.02 rows=1 width=57) (actual time=0.083..0.085 rows=1 loops=1)
Recheck Cond: (ids = '{"(15,15)"}'::comp_type[])
Heap Blocks: exact=1
-> Bitmap Index Scan on idx_demo_table_gin (cost=0.00..20.01 rows=1 width=0) (actual time=0.068..0.068 rows=1 loops=1)
Index Cond: (ids = '{"(15,15)"}'::comp_type[])
Planning Time: 0.974 ms
Execution Time: 0.155 ms

DROP INDEX idx_demo_table_gin;

CREATE INDEX idx_demo_table_hash ON idx_demo_table USING HASH(ids);

EXPLAIN ANALYSE SELECT * FROM idx_demo_table WHERE ids = ARRAY[(15,15)::comp_type];

Index Scan using idx_demo_table_hash on idx_demo_table (cost=0.00..8.02 rows=1 width=57) (actual time=0.033..0.119 rows=1 loops=1)
Index Cond: (ids = '{"(15,15)"}'::comp_type[])
Planning Time: 0.140 ms
Execution Time: 0.142 ms

As we can observe in this example, the planner cost for a HASH index is 8.02, which is better when compared to the B-tree (8.57) and GIN (24.02) indexes.

HASH indexes may have better performance in some use cases where the array is being compared as a record not for a range of records, but it might not be suitable for tables with a rapidly increasing number of rows. Additionally, duplicate data in a table can reduce HASH index performance.

BRIN (min max) index

A Block Range Index (BRIN) works on a block (page range) of data. A block is group of pages lying physically adjacent on the disk for a table. For each block range, summary information is stored by the index. This information stores the minimum and maximum value for a range of adjacent table pages.

BRIN is typically used for timeseries data, ordered datasets or when data types like integers—dates where the sort order is linear—can be stored as min and max values in the range. BRIN is incredibly helpful in efficiently searching over large time series data and has the benefit of taking up significantly less space on disk than a standard B-tree index. However, scanning an index will return more rows than you asked for, so you must filter out these additional rows as an additional step in planning your query.

For example, if we consider emp_id, which is created as the sequential number using the generate_series() function, and build a BRIN index, we can use a BRIN index scan. However, the runtime (0.512 milliseconds, as seen in the previous example) of the B-tree index will still be less compared to the BRIN index (3.852 milliseconds), but the index size of the BRIN index (384 KB) will be smaller compared to the B-tree index (928 MB).

CREATE INDEX idx_emp_id_brin ON employee USING brin(emp_id);

EXPLAIN ANALYZE select * from employee WHERE emp_id > 444 and emp_id < 555;

Bitmap Heap Scan on employee (cost=24.03..14849.30 rows=1 width=199) (actual time=2.192..3.805 rows=95 loops=1)
Recheck Cond: ((emp_id > 444) AND (emp_id < 555))
Rows Removed by Index Recheck: 4432
Heap Blocks: lossy=128
-> Bitmap Index Scan on idx_emp_id_brin (cost=0.00..24.03 rows=4351 width=0) (actual time=0.297..0.297 rows=1280 loops=1)
Index Cond: ((emp_id > 444) AND (emp_id < 555))
Planning Time: 0.695 ms
Execution Time: 3.852 ms

Name       | Type | Owner   | Table   | Persistence | Access method | Size
----------------+-------+----------+----------+-------------+---------------+------
idx_emp_id      | index | postgres | employee | permanent   | btree         | 93 MB
idx_emp_id_brin | index | postgres | employee | permanent   | brin          | 72 kB

Clean up

Complete the following steps to clean up the resources and data you created during this post:

  1. Delete your DB cluster or DB instance.
  2. Delete the sample data with the following code:
DROP TABLE employee;
DROP TABLE points_in_box;
DROP TABLE idx_demo_table;
DROP TYPE comp_type;

Conclusion

In this post, we discussed various use cases for GIN, GiST, HASH, and BRIN indexes supported in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL.

If you have similar use cases in your application or workload, you can refer to this post and implement similar solutions.

In upcoming Part 3 of this series, we discuss indexes supported through database extensions like btree-gin, btree-gist, bloom, and SP-GiST.

Leave any thoughts or questions in the comments section.


About the Authors

Sachin Khanna is a Lead Database Consultant on the AWS Professional Services team at AWS. He has extensive experience in relational databases with a passion for Amazon Aurora PostgreSQL. He is a consultant with expertise on performance and optimization. He carries rich experience in database migrations and cost optimization, and has helped customers in their cloud migration journey.

Rajkumar Raghuwanshi is a Database Consultant with AWS Professional Services based out of Pune, India. With a good knowledge of relational databases and hands-on experience in homogenous and heterogenous database migrations, he helps customers migrate to the AWS Cloud and optimize performance.

Sachin Kotwal is a Lead Database Consultant with AWS Professional Services based out of Pune, India. He has more than a decade of experience in relational database administration and architecture. He has worked on many homogenous database migrations and performance optimizations.