AWS Database Blog

Index types supported in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL using extensions (Bloom, pg_trgm, and pg_bigm)

Do you need to perform fuzzy text searches, find similar strings across large datasets, or efficiently filter across many columns without creating dozens of individual indexes? The Bloom, pg_trgm, and pg_bigm extensions for PostgreSQL address these exact challenges.

In Part 1, Part 2, and Part 3 of this series, we explored PostgreSQL’s native indexes (B-tree, GIN, GiST, HASH, BRIN) and specialized extension-based index types (SP-GiST, btree_gin, btree_gist). In this post, we dive into three additional extensions available in Amazon Aurora PostgreSQL-Compatible Edition and Amazon Relational Database Service (Amazon RDS) for PostgreSQL:

  • Bloom – for space-efficient multi-column equality filtering
  • pg_trgm – for fuzzy text matching and similarity searches
  • pg_bigm – for full-text search optimized for Asian languages (such as Japanese, Chinese, and Korean)

These extensions use the PostgreSQL extensible index infrastructure to give you specialized indexing strategies for text search and multi-column filtering scenarios that native indexes don’t efficiently address.

Prerequisites

We use SQL examples throughout this post to demonstrate concepts. If you want to reproduce the examples in your environment, you need the following:

  • An Amazon RDS for PostgreSQL or Aurora PostgreSQL database instance
  • A client machine with psql installed and a connection to the database

For more information, refer to Connecting to a DB instance running the PostgreSQL database engine.

Refer to the following script to generate the sample dataset. This script generates a sample dataset for a products_infor table with various columns containing different data types like character strings, integers, dates, arrays, and JSONB. It uses PostgreSQL functions like RANDOM(), REPEAT(), and GENERATE_SERIES() to generate random sample data that we use throughout this post.

CREATE TABLE products_infor(
    id SERIAL PRIMARY KEY,
    name TEXT,
    category TEXT,
    description TEXT
);

INSERT INTO products_infor (
    name,
    category,
    description
)
SELECT
    (ARRAY[
        'Wireless Bluetooth Headphones', 'Ergonomic Office Chair',
        'Stainless Steel Water Bottle','Organic Cotton T-Shirt',
        'Portable Power Bank 20000mAh', 'Bamboo Cutting Board Set',
        'Noise Cancelling Earbuds', 'Memory Foam Pillow',
        'Cast Iron Skillet 12 inch','Yoga Mat Non-Slip Premium',
        'LED Desk Lamp Adjustable', 'Ceramic Coffee Mug Set',
        'Running Shoes Lightweight', 'Insulated Lunch Bag',
        'Stainless Steel Thermos', 'Organic Green Tea Collection',
        'Smart Watch Fitness Tracker', 'Bamboo Laptop Stand Adjustable',
        'Ultra Slim Power Adapter', 'Premium Leather Wallet',
        'Recycled Polyester Jacket', 'Solar Powered Charger',
        'Digital Kitchen Scale', 'Aromatherapy Essential Oils',
        'Copper Bottom Cookware Set'
    ])[1 + (i % 30)] AS name,
    (ARRAY[
        'Electronics', 'Furniture', 'Kitchen', 'Clothing',
        'Sports', 'Health', 'Accessories', 'Outdoor'
    ])[1 + (i % 8)] AS category,
    (ARRAY[
        'Wireless bluetooth headphones with active noise cancellation',
        'Ergonomic office chair with adjustable lumbar support design',
        'Vacuum insulated stainless steel water bottle, cold 24 hours',
        'Soft organic cotton t-shirt made from GOTS certified cotton',
        'Portable power bank 20000mAh with dual USB-C fast charging',
        'Premium bamboo cutting board set with juice grooves handles',
        'True wireless earbuds with hybrid noise cancellation IPX5',
        'Memory foam pillow with cooling gel layer for regulation',
        'Pre-seasoned cast iron skillet with superior heat retention',
        'Non-slip yoga mat 6mm thick with alignment lines and strap',
        'Modern LED desk lamp with 5 color temperatures adjustable',
        'Handcrafted ceramic coffee mug set with unique glazed finish',
        'Lightweight running shoes with responsive cushioning design',
        'Insulated lunch bag with multiple compartments leak-proof',
        'Mechanical gaming keyboard with hot-swappable RGB switches',
        'Stainless steel thermos with one-touch lid and tea infuser',
        'Premium organic green tea collection from sustainable farms',
        'Waterproof hiking boots with Vibram outsole Gore-Tex lining',
        'Fitness tracker smartwatch with heart rate monitor and GPS',
        'Bamboo laptop stand adjustable with ventilation for cooling'
    ])[1 + (i % 20)] AS description
FROM generate_series(1, 100000) i;

Bloom index

Bloom indexes are based on Bloom filters. A Bloom filter is a data structure designed to maximize efficiency in both speed and storage space. It excels at quickly determining when elements aren’t present in a dataset, making it particularly valuable for data skipping and rapid set membership testing. Although a Bloom filter doesn’t produce false negatives, it might produce false positives (indicating an element exists when it doesn’t), thereby requiring additional verification. Index search results must always be rechecked using the actual attribute values from the heap entry.

  • Bloom filters and indexes are recommended for use cases involving numerous columns and frequent queries with random combinations of equality operators (=).
  • By merging multiple column values into a single index entry, Bloom indexes provide a space-efficient solution, potentially replacing multiple B-tree indexes with a single, lossy index that requires rechecking matched values for accuracy.

For example, let’s consider searching any product in the products_infor table on name, category, and description attributes:

EXPLAIN ANALYZE SELECT * FROM products_infor where name='Ergonomic Office Chair'and category='furniture' and description='Wireless bluetooth headphones with active noise cancellation';

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..254065.85 rows=1 width=100) (actual time=623.995..625.457 rows=0.00 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   Buffers: shared hit=16115 read=137395
   ->  Parallel Seq Scan on products_infor  (cost=0.00..253065.75 rows=1 width=100) (actual time=600.442..600.442 rows=0.00 loops=2)
         Filter: ((name = 'Ergonomic Office Chair'::text) AND (category = ' furniture'::text) AND (description = ' Wireless bluetooth headphones with active noise cancellation'::text))
         Rows Removed by Filter: 5000000
         Buffers: shared hit=16115 read=137395
 Planning Time: 0.402 ms
 Execution Time: 625.504 ms

Let’s create a Bloom index on these fields and observe how this can be helpful in searching tables across multiple attributes with an equality operator. First, let’s create the Bloom extension to support Bloom index creation:

CREATE EXTENSION bloom;

CREATE INDEX idx_product_bloom ON products_infor USING bloom (category, name, description);

EXPLAIN ANALYZE SELECT * FROM products_infor where name='Ergonomic Office Chair' and category='furniture' and description='Wireless bluetooth headphones with active noise cancellation';

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on products_infor  (cost=203436.00..203440.02 rows=1 width=100) (actual time=37.552..37.553 rows=0.00 loops=1)
   Recheck Cond: ((category = ' furniture'::text) AND (name = 'Ergonomic Office Chair'::text) AND (description = ' Wireless bluetooth headphones with active noise cancellation'::text))
   Buffers: shared hit=16109 read=3499
   ->  Bitmap Index Scan on idx_product_bloom  (cost=0.00..203436.00 rows=1 width=0) (actual time=37.540..37.540 rows=0.00 loops=1)
         Index Cond: ((category = ' furniture'::text) AND (name = 'Ergonomic Office Chair'::text) AND (description = ' Wireless bluetooth headphones with active noise cancellation'::text))
         Index Searches: 1
         Buffers: shared hit=16109 read=3499
 Planning:
   Buffers: shared hit=20 read=1
 Planning Time: 4.253 ms
 Execution Time: 38.461 ms

Tuning Bloom index accuracy

You can increase the accuracy of bloom filtering by tuning parameters when creating the index. These parameters control the trade-off between scan accuracy and index size:

  • length – Signature length in bits (default: 80, max: 4096), rounded up to the nearest multiple of 16. Larger values reduce false positives but increase index size.
  • col1-col32 – Number of bits generated for each index column (default: 2, max: 4095). Allocate more bits to columns with higher cardinality or more unique values.

Choosing parameter values:

  • Analyze the cardinality of your data. Columns with more distinct values benefit from higher bit allocation.
  • Larger text fields with high variability should receive more bits. For example, in a multi-column index where name has high uniqueness and description contains long variable text, allocate more bits to these columns compared to a low-cardinality category field.
  • Balance accuracy improvements against the resulting index size increase.

Considerations when using Bloom indexes

While Bloom indexes excel at multi-column equality filtering, they come with several important trade-offs to consider. First, their probabilistic nature means they can return false positives—rows that match the filter but don’t actually satisfy your query conditions. PostgreSQL must still verify each candidate row, adding overhead to query execution.

Bloom indexes are also limited in their query support. They only work with equality operators (=) and can’t assist with range queries (<, >, BETWEEN) or ordering operations (ORDER BY). If your queries require sorting or range filtering, you must rely on other index types or sequential scans for those operations.

Bloom indexes can’t enforce UNIQUE constraints and have higher write overhead compared to B-tree indexes.

pg_trgm

PostgreSQL’s pg_trgm extension transforms traditional exact-match searches into error-tolerant queries by breaking text into three-character segments (trigrams). It provides functions and operators for determining the similarity of text based on trigram matching. A trigram is a group of three consecutive characters taken from a string. When determining the set of trigrams in a string, PostgreSQL prefixes two spaces and suffixes one space to the string. For example, the set of trigrams in the word “cat” is ” c”, ” ca”, “cat”, “at”:

For a practical implementation of pg_trgm, see Building a job search engine with PostgreSQL’s advanced search features.

SELECT show_trgm('cat');
                show_trgm
-----------------------------------------
 {"  c"," ca","at ",cat}
(1 row)

This feature excels at finding similar strings even when users make typing mistakes or use spelling variations.

With pg_trgm, your database can quickly identify text patterns that are close matches rather than requiring perfect matches. The extension is particularly valuable for applications such as autocomplete suggestions, text search systems, or document management systems. It offers the following capabilities:

  • Text similarity measurement – It calculates and returns a numerical value between 0 (completely different) and 1 (identical) for finding how similar the two strings are. For example, similarity('hello', 'helo') returns 0.57, meaning “hello” and “helo” are moderate similarity (approximately 57%).
  • Document searching – This is also helpful in building text search applications. For example, the function word_similarity('word', 'two words') looks for a word within a longer phrase. It’s useful when you want to find a word that might be part of a longer text. The score of 0.8 means “word” is very similar to part of “two words.”
  • Fuzzy string matching – This enables “did you mean?” type searches. It’s helpful for handling typographical errors (“recieve” finds “receive”) in search queries or alternative spellings (“color” matches “colour”) or misspelled words (“elefant” finds “elephant”).

Tuning text search accuracy

You can tune the text search accuracy and performance using threshold parameters that control how PostgreSQL matches similar text. You can set these parameters at the session level or globally, and they determine the balance between finding more matches (recall) and finding only relevant matches (precision).

The following thresholds make it possible to fine-tune your text searches:

  • pg_trgm.similarity_threshold (default 0.3) – Controls how similar two entire strings must be for the % operator to consider them a match. Lower values such as 0.2 cast a wider net but might include false positives, while higher values such as 0.5 provide more precise matches but might miss relevant results.
  • pg_trgm.word_similarity_threshold (default 0.6) – Used with <% and %> operators for finding word similarities within longer strings. This is useful when searching for a word that might be part of a longer text.
  • pg_trgm.strict_word_similarity_threshold (default 0.5) – Used with <<% and %>> operators for stricter whole-word matching. This provides the most precise matching, typically requiring near-exact word matches.

pg_trgm index support

The pg_trgm extension adds support for additional operator classes that you can use with GIN and GiST indexes. These operator classes support fast similarity searches and pattern matching operations on text columns. For more details on GiST and GIN indexes, refer to Part 2 of this series.

The following operator classes are supported:

  • gin_trgm_ops – This operator class is designed for GIN indexes. It’s particularly efficient for large datasets and supports a wide range of text search operations. With gin_trgm_ops, you can perform fast LIKE, ILIKE, and regular expression searches. It’s excellent for scenarios where you need to find similar strings or implement fuzzy matching, even with large amounts of data.
  • gist_trgm_ops – This operator class is used with GiST indexes. Although it supports similar operations to gin_trgm_ops, it has different performance characteristics. GiST indexes are generally faster to build and update, but can be slower for searches compared to GIN indexes, especially for larger datasets.

These indexes don’t support equality or simple comparison operators.

CREATE EXTENSION pg_trgm;

Example 1: Using a GiST index

The following code illustrates the example using a GiST index with the gist_trgm_ops operator class:

CREATE INDEX idx_products_desc_gist ON products_infor USING gist(description gist_trgm_ops);

EXPLAIN ANALYZE
SELECT count(1)
FROM products_infor
WHERE description ILIKE '%green tea%';

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3954.12..3954.13 rows=1 width=8) (actual time=6961.013..6961.014 rows=1.00 loops=1)
   Buffers: shared hit=6209 read=176211 written=2340
   ->  Bitmap Heap Scan on products_infor  (cost=192.17..3951.62 rows=1000 width=0) (actual time=3897.966..6947.678 rows=500000.00 loops=1)
         Recheck Cond: (description ~~* '%green tea%'::text)
         Rows Removed by Index Recheck: 6139644
         Heap Blocks: exact=54280 lossy=99230
         Buffers: shared hit=6209 read=176211 written=2340
         ->  Bitmap Index Scan on idx_products_desc_gist  (cost=0.00..191.92 rows=1000 width=0) (actual time=3891.659..3891.659 rows=500000.00 loops=1)
               Index Cond: (description ~~* '%green tea%'::text)
               Index Searches: 1
               Buffers: shared hit=6209 read=22701 written=2340
 Planning Time: 0.277 ms
 Execution Time: 6961.108 ms

The execution plan shows that PostgreSQL uses the GiST trigram index (idx_products_desc_gist) to efficiently filter the description column with the ILIKE pattern match. The index condition is evaluated within the index scan itself, resulting in fast execution (6961ms).

Example 2: Using a GIN index

The following code illustrates the example using a GIN index with the gin_trgm_ops operator class:

CREATE INDEX idx_gin_trgm_idx ON products_infor USING gin(category gin_trgm_ops);

EXPLAIN ANALYZE
SELECT category, description
FROM products_infor
WHERE category ILIKE '%Sports%';

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on products_infor  (cost=9720.40..264944.75 rows=1273000 width=68) (actual time=153.711..2783.331 rows=1250000.00 loops=1)
   Recheck Cond: (category ~~* '%Sports%'::text)
   Rows Removed by Index Recheck: 5632606
   Heap Blocks: exact=54690 lossy=98820
   Buffers: shared hit=1283 read=153507 written=12899
   ->  Bitmap Index Scan on idx_gin_trgm_idx  (cost=0.00..9402.15 rows=1273000 width=0) (actual time=148.011..148.011 rows=1250000.00 loops=1)
         Index Cond: (category ~~* '%Sports%'::text)
         Index Searches: 1
         Buffers: shared hit=1280
 Planning:
   Buffers: shared hit=16
 Planning Time: 1.689 ms
 Execution Time: 2811.020 ms

The execution plan shows that PostgreSQL uses the GIN trigram index (idx_gin_trgm_idx) to efficiently filter the category column with the ILIKE pattern match. The index uses a Bitmap Index Scan followed by a Bitmap Heap Scan, which is typical for GIN indexes. The execution time is 2811.02 ms.

The pg_trgm extension has some key limitations:

  • Index size – Trigram indexes can be significantly larger than regular B-tree indexes. GIN indexes especially can consume 3–4 times more space than the original text data.
  • Memory usage – High memory usage during index creation and complex queries.
  • Short string performance – Not optimal for very short strings (fewer than three characters). For columns storing short values such as category codes or status flags, consider using B-tree indexes instead.
  • Long string performance – Search performance and index creation time degrade with very long strings, as the number of trigrams grows linearly with string length.
  • Threshold sensitivity – Full table scans might occur if you set the similarity threshold too low. Tune this value carefully based on your use case.
  • Match accuracy – Similarity matches might return results that appear textually similar but are semantically unrelated (false positives). Conversely, relevant matches might be missed if the text’s character composition differs significantly from the search term.

pg_bigm

The pg_bigm extension enhances full-text search capabilities in PostgreSQL for languages that don’t use whitespace to separate words, such as Japanese, Chinese, and Korean. The pg_bigm module provides full text search capability for Asian languages that use multi-byte character sets like Chinese, Korean, and Japanese. With this module, you can create a 2-gram (bigram) index for faster full text search.

You can create a GIN index on bigrams of strings and use several functions for text searching and similarity calculation.

The gin_bigm_ops operator class is designed for GIN indexes specifically for handling Asian languages, particularly Japanese text, using bigrams (2-character sequences). It supports efficient LIKE pattern matching operations and supports similarity searches for non-whitespace delimited languages.

To create the extension pg_bigm, use the following code:

CREATE EXTENSION pg_bigm;

We use the following code to create sample data for the Japanese language:

CREATE TABLE japanese_books (
    id SERIAL PRIMARY KEY,
    title TEXT,
    author TEXT,
    description TEXT
);

WITH RECURSIVE
japanese_titles AS (
    SELECT ARRAY[
        '夏目漱石の坊ちゃん', '人間失格', '雪国', 'こころ', '羅生門',
        '銀河鉄道の夜', '風の谷', '時をかける少女', '君の名は', '千と千尋',
        '龍の子太郎', '海辺のカフカ', '春の雪', '道', '細雪'
    ] as titles
),
japanese_authors AS (
    SELECT ARRAY[
        '夏目漱石', '太宰治', '川端康成', '芥川龍之介', '宮沢賢治',
        '村上春樹', '三島由紀夫', '谷崎潤一郎', '大江健三郎', '安部公房',
        '江戸川乱歩', '吉本ばなな', '森鴎外', '遠藤周作', '樋口一葉'
    ] as authors
),
japanese_descriptions AS (
    SELECT ARRAY[
        '高等学校の教師として赴任した主人公の物語',
        '人間として失格した男の告白的な物語',
        '温泉地を舞台にした芸者との恋愛小説',
        '明治時代を生きた教師と学生の心の物語',
        '平安時代末期の京都を舞台にした短編小説',
        '現代社会を舞台にした青春小説',
        '伝統と現代が交差する日本の物語',
        '不思議な出来事に巻き込まれる少年の冒険',
        '日本の四季を背景にした恋愛物語',
        '戦後日本を描いた歴史小説'
    ] as descriptions
)
INSERT INTO japanese_books (title, author, description)
SELECT
    titles[1 + floor(random() * array_length(titles, 1))::integer],
    authors[1 + floor(random() * array_length(authors, 1))::integer],
    descriptions[1 + floor(random() * array_length(descriptions, 1))::integer]
FROM
    generate_series(1, 100000), japanese_titles, japanese_authors, japanese_descriptions;

Let’s try to find a book with a given title using the LIKE operator:

explain analyze SELECT title, author FROM japanese_books WHERE title LIKE '%こころ%';
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Seq Scan on japanese_books  (cost=0.00..5296.06 rows=13054 width=26) (actual time=0.218..33.279 rows=13235 loops=1)
   Filter: (title ~~ '%こころ%'::text)
   Rows Removed by Filter: 186770
 Planning Time: 0.791 ms
 Execution Time: 33.737 ms
(5 rows)

Now let’s create a GIN index with the gin_bigm_ops operator:

CREATE INDEX idx_books_title_bigm ON japanese_books USING gin (title gin_bigm_ops);
CREATE INDEX

explain analyze SELECT title, author FROM japanese_books WHERE title LIKE '%こころ%';

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on japanese_books  (cost=145.17..3104.34 rows=13054 width=26) (actual time=3.309..11.485 rows=13235 loops=1)
   Recheck Cond: (title ~~ '%こころ%'::text)
   Heap Blocks: exact=2774
   ->  Bitmap Index Scan on idx_books_title_bigm  (cost=0.00..141.90 rows=13054 width=0) (actual time=2.681..2.681 rows=13235 loops=1)
         Index Cond: (title ~~ '%こころ%'::text)
 Planning Time: 1.680 ms
 Execution Time: 12.149 ms

The pg_bigm extension has some key limitations:

  • Index size can grow rapidly – Bigram indexes are usually larger than trigram indexes on ASCII data of similar length. They can be up to 5–10x bigger than the column data.
  • No support for phrase searchpg_bigm doesn’t support phrase queries or proximity search.
  • Limited language awareness – No stemming, stopwords, synonyms—bigrams are raw 2-character pairs.

Clean up

Complete the following steps to clean up the resources and data that you might have created in this post to follow along:

  1. Delete the sample data with the following code:
    DROP TABLE products_infor;
    DROP TABLE japanese_books;
  2. Delete your DB cluster or DB instance.

Conclusion

In this post, we discussed specialized text search and filtering indexing techniques supported in Aurora PostgreSQL and Amazon RDS for PostgreSQL through extensions. Bloom indexes provide space-efficient multi-column equality filtering, pg_trgm enables fuzzy text matching and similarity searches, and pg_bigm offers full-text search capabilities optimized for Asian languages.

Throughout this four-part series, we explored the full spectrum of indexing capabilities available in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL:

  • Part 1 covered B-tree indexes—the default and most widely used index type, ideal for equality and range queries on ordered data.
  • Part 2 expanded into GIN, GiST, HASH, and BRIN indexes, addressing use cases such as full-text search, geometric data, exact-match lookups, and large sequential datasets.
  • Part 3 introduced extension-based indexes including SP-GiST, btree_gin, and btree_gist, enabling efficient spatial searches, unified indexing across mixed column types, and exclusion constraints for high-update workloads.
  • Part 4 completed the series with Bloom, pg_trgm, and pg_bigm extensions for multi-column filtering, fuzzy matching, and multilingual text search.

Together, these indexing strategies provide a comprehensive toolkit for optimizing query performance across diverse workloads. When implementing any of these indexes, analyze the performance impact on your overall workload and test thoroughly before production deployment.

If you have similar use cases in your application or workload, you can refer to this series and implement similar solutions to improve query performance, reduce storage overhead, and enhance the search capabilities of your PostgreSQL databases. You can use Amazon Aurora Fast Database Cloning to create test environments for evaluating index strategies without affecting production systems.

Leave your thoughts or questions in the comments section.


About the authors

Rajkumar Raghuwanshi

Rajkumar Raghuwanshi

Rajkumar is a Delivery Consultant with AWS Professional Services based out of Pune, India. With extensive knowledge of relational databases and hands-on experience in homogenous and heterogenous database migrations, he helps customers migrate to the AWS Cloud, code conversion and optimize performance.

Sachin Khanna

Sachin Khanna

Sachin is a Lead Consultant specializing in artificial intelligence and machine learning (AI/ML) within the AWS Professional Services team. With a strong background in data management, generative AI, large language models, and machine learning, he brings extensive expertise to projects involving data, databases, and AI-driven solutions. His proficiency in cloud migration and cost optimization has enabled him to guide customers through successful cloud adoption journeys, delivering tailored solutions and strategic insights.