AWS Database Blog
Index types supported in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL using extensions (Bloom, pg_trgm, and pg_bigm)
Do you need to perform fuzzy text searches, find similar strings across large datasets, or efficiently filter across many columns without creating dozens of individual indexes? The Bloom, pg_trgm, and pg_bigm extensions for PostgreSQL address these exact challenges.
In Part 1, Part 2, and Part 3 of this series, we explored PostgreSQL’s native indexes (B-tree, GIN, GiST, HASH, BRIN) and specialized extension-based index types (SP-GiST, btree_gin, btree_gist). In this post, we dive into three additional extensions available in Amazon Aurora PostgreSQL-Compatible Edition and Amazon Relational Database Service (Amazon RDS) for PostgreSQL:
- Bloom – for space-efficient multi-column equality filtering
pg_trgm– for fuzzy text matching and similarity searchespg_bigm– for full-text search optimized for Asian languages (such as Japanese, Chinese, and Korean)
These extensions use the PostgreSQL extensible index infrastructure to give you specialized indexing strategies for text search and multi-column filtering scenarios that native indexes don’t efficiently address.
Prerequisites
We use SQL examples throughout this post to demonstrate concepts. If you want to reproduce the examples in your environment, you need the following:
- An Amazon RDS for PostgreSQL or Aurora PostgreSQL database instance
- A client machine with psql installed and a connection to the database
For more information, refer to Connecting to a DB instance running the PostgreSQL database engine.
Refer to the following script to generate the sample dataset. This script generates a sample dataset for a products_infor table with various columns containing different data types like character strings, integers, dates, arrays, and JSONB. It uses PostgreSQL functions like RANDOM(), REPEAT(), and GENERATE_SERIES() to generate random sample data that we use throughout this post.
Bloom index
Bloom indexes are based on Bloom filters. A Bloom filter is a data structure designed to maximize efficiency in both speed and storage space. It excels at quickly determining when elements aren’t present in a dataset, making it particularly valuable for data skipping and rapid set membership testing. Although a Bloom filter doesn’t produce false negatives, it might produce false positives (indicating an element exists when it doesn’t), thereby requiring additional verification. Index search results must always be rechecked using the actual attribute values from the heap entry.
- Bloom filters and indexes are recommended for use cases involving numerous columns and frequent queries with random combinations of equality operators (=).
- By merging multiple column values into a single index entry, Bloom indexes provide a space-efficient solution, potentially replacing multiple B-tree indexes with a single, lossy index that requires rechecking matched values for accuracy.
For example, let’s consider searching any product in the products_infor table on name, category, and description attributes:
Let’s create a Bloom index on these fields and observe how this can be helpful in searching tables across multiple attributes with an equality operator. First, let’s create the Bloom extension to support Bloom index creation:
Tuning Bloom index accuracy
You can increase the accuracy of bloom filtering by tuning parameters when creating the index. These parameters control the trade-off between scan accuracy and index size:
- length – Signature length in bits (default: 80, max: 4096), rounded up to the nearest multiple of 16. Larger values reduce false positives but increase index size.
- col1-col32 – Number of bits generated for each index column (default: 2, max: 4095). Allocate more bits to columns with higher cardinality or more unique values.
Choosing parameter values:
- Analyze the cardinality of your data. Columns with more distinct values benefit from higher bit allocation.
- Larger text fields with high variability should receive more bits. For example, in a multi-column index where
namehas high uniqueness anddescriptioncontains long variable text, allocate more bits to these columns compared to a low-cardinalitycategoryfield. - Balance accuracy improvements against the resulting index size increase.
Considerations when using Bloom indexes
While Bloom indexes excel at multi-column equality filtering, they come with several important trade-offs to consider. First, their probabilistic nature means they can return false positives—rows that match the filter but don’t actually satisfy your query conditions. PostgreSQL must still verify each candidate row, adding overhead to query execution.
Bloom indexes are also limited in their query support. They only work with equality operators (=) and can’t assist with range queries (<, >, BETWEEN) or ordering operations (ORDER BY). If your queries require sorting or range filtering, you must rely on other index types or sequential scans for those operations.
Bloom indexes can’t enforce UNIQUE constraints and have higher write overhead compared to B-tree indexes.
pg_trgm
PostgreSQL’s pg_trgm extension transforms traditional exact-match searches into error-tolerant queries by breaking text into three-character segments (trigrams). It provides functions and operators for determining the similarity of text based on trigram matching. A trigram is a group of three consecutive characters taken from a string. When determining the set of trigrams in a string, PostgreSQL prefixes two spaces and suffixes one space to the string. For example, the set of trigrams in the word “cat” is ” c”, ” ca”, “cat”, “at”:
For a practical implementation of pg_trgm, see Building a job search engine with PostgreSQL’s advanced search features.
This feature excels at finding similar strings even when users make typing mistakes or use spelling variations.
With pg_trgm, your database can quickly identify text patterns that are close matches rather than requiring perfect matches. The extension is particularly valuable for applications such as autocomplete suggestions, text search systems, or document management systems. It offers the following capabilities:
- Text similarity measurement – It calculates and returns a numerical value between 0 (completely different) and 1 (identical) for finding how similar the two strings are. For example,
similarity('hello', 'helo')returns 0.57, meaning “hello” and “helo” are moderate similarity (approximately 57%). - Document searching – This is also helpful in building text search applications. For example, the function
word_similarity('word', 'two words')looks for a word within a longer phrase. It’s useful when you want to find a word that might be part of a longer text. The score of 0.8 means “word” is very similar to part of “two words.” - Fuzzy string matching – This enables “did you mean?” type searches. It’s helpful for handling typographical errors (“recieve” finds “receive”) in search queries or alternative spellings (“color” matches “colour”) or misspelled words (“elefant” finds “elephant”).
Tuning text search accuracy
You can tune the text search accuracy and performance using threshold parameters that control how PostgreSQL matches similar text. You can set these parameters at the session level or globally, and they determine the balance between finding more matches (recall) and finding only relevant matches (precision).
The following thresholds make it possible to fine-tune your text searches:
pg_trgm.similarity_threshold(default 0.3) – Controls how similar two entire strings must be for the%operator to consider them a match. Lower values such as 0.2 cast a wider net but might include false positives, while higher values such as 0.5 provide more precise matches but might miss relevant results.pg_trgm.word_similarity_threshold(default 0.6) – Used with<%and%>operators for finding word similarities within longer strings. This is useful when searching for a word that might be part of a longer text.pg_trgm.strict_word_similarity_threshold(default 0.5) – Used with<<%and%>>operators for stricter whole-word matching. This provides the most precise matching, typically requiring near-exact word matches.
pg_trgm index support
The pg_trgm extension adds support for additional operator classes that you can use with GIN and GiST indexes. These operator classes support fast similarity searches and pattern matching operations on text columns. For more details on GiST and GIN indexes, refer to Part 2 of this series.
The following operator classes are supported:
gin_trgm_ops– This operator class is designed for GIN indexes. It’s particularly efficient for large datasets and supports a wide range of text search operations. Withgin_trgm_ops, you can perform fastLIKE,ILIKE, and regular expression searches. It’s excellent for scenarios where you need to find similar strings or implement fuzzy matching, even with large amounts of data.gist_trgm_ops– This operator class is used with GiST indexes. Although it supports similar operations togin_trgm_ops, it has different performance characteristics. GiST indexes are generally faster to build and update, but can be slower for searches compared to GIN indexes, especially for larger datasets.
These indexes don’t support equality or simple comparison operators.
Example 1: Using a GiST index
The following code illustrates the example using a GiST index with the gist_trgm_ops operator class:
The execution plan shows that PostgreSQL uses the GiST trigram index (idx_products_desc_gist) to efficiently filter the description column with the ILIKE pattern match. The index condition is evaluated within the index scan itself, resulting in fast execution (6961ms).
Example 2: Using a GIN index
The following code illustrates the example using a GIN index with the gin_trgm_ops operator class:
The execution plan shows that PostgreSQL uses the GIN trigram index (idx_gin_trgm_idx) to efficiently filter the category column with the ILIKE pattern match. The index uses a Bitmap Index Scan followed by a Bitmap Heap Scan, which is typical for GIN indexes. The execution time is 2811.02 ms.
The pg_trgm extension has some key limitations:
- Index size – Trigram indexes can be significantly larger than regular B-tree indexes. GIN indexes especially can consume 3–4 times more space than the original text data.
- Memory usage – High memory usage during index creation and complex queries.
- Short string performance – Not optimal for very short strings (fewer than three characters). For columns storing short values such as category codes or status flags, consider using B-tree indexes instead.
- Long string performance – Search performance and index creation time degrade with very long strings, as the number of trigrams grows linearly with string length.
- Threshold sensitivity – Full table scans might occur if you set the similarity threshold too low. Tune this value carefully based on your use case.
- Match accuracy – Similarity matches might return results that appear textually similar but are semantically unrelated (false positives). Conversely, relevant matches might be missed if the text’s character composition differs significantly from the search term.
pg_bigm
The pg_bigm extension enhances full-text search capabilities in PostgreSQL for languages that don’t use whitespace to separate words, such as Japanese, Chinese, and Korean. The pg_bigm module provides full text search capability for Asian languages that use multi-byte character sets like Chinese, Korean, and Japanese. With this module, you can create a 2-gram (bigram) index for faster full text search.
You can create a GIN index on bigrams of strings and use several functions for text searching and similarity calculation.
The gin_bigm_ops operator class is designed for GIN indexes specifically for handling Asian languages, particularly Japanese text, using bigrams (2-character sequences). It supports efficient LIKE pattern matching operations and supports similarity searches for non-whitespace delimited languages.
To create the extension pg_bigm, use the following code:
We use the following code to create sample data for the Japanese language:
Let’s try to find a book with a given title using the LIKE operator:
Now let’s create a GIN index with the gin_bigm_ops operator:
The pg_bigm extension has some key limitations:
- Index size can grow rapidly – Bigram indexes are usually larger than trigram indexes on ASCII data of similar length. They can be up to 5–10x bigger than the column data.
- No support for phrase search –
pg_bigmdoesn’t support phrase queries or proximity search. - Limited language awareness – No stemming, stopwords, synonyms—bigrams are raw 2-character pairs.
Clean up
Complete the following steps to clean up the resources and data that you might have created in this post to follow along:
- Delete the sample data with the following code:
- Delete your DB cluster or DB instance.
Conclusion
In this post, we discussed specialized text search and filtering indexing techniques supported in Aurora PostgreSQL and Amazon RDS for PostgreSQL through extensions. Bloom indexes provide space-efficient multi-column equality filtering, pg_trgm enables fuzzy text matching and similarity searches, and pg_bigm offers full-text search capabilities optimized for Asian languages.
Throughout this four-part series, we explored the full spectrum of indexing capabilities available in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL:
- Part 1 covered B-tree indexes—the default and most widely used index type, ideal for equality and range queries on ordered data.
- Part 2 expanded into GIN, GiST, HASH, and BRIN indexes, addressing use cases such as full-text search, geometric data, exact-match lookups, and large sequential datasets.
- Part 3 introduced extension-based indexes including SP-GiST, btree_gin, and btree_gist, enabling efficient spatial searches, unified indexing across mixed column types, and exclusion constraints for high-update workloads.
- Part 4 completed the series with Bloom,
pg_trgm, andpg_bigmextensions for multi-column filtering, fuzzy matching, and multilingual text search.
Together, these indexing strategies provide a comprehensive toolkit for optimizing query performance across diverse workloads. When implementing any of these indexes, analyze the performance impact on your overall workload and test thoroughly before production deployment.
If you have similar use cases in your application or workload, you can refer to this series and implement similar solutions to improve query performance, reduce storage overhead, and enhance the search capabilities of your PostgreSQL databases. You can use Amazon Aurora Fast Database Cloning to create test environments for evaluating index strategies without affecting production systems.
Leave your thoughts or questions in the comments section.