Migrate full-text search from SQL Server to Amazon Aurora PostgreSQL-compatible edition or Amazon RDS for PostgreSQL

In today’s data-driven world, the ability to search and retrieve information from vast datasets is crucial. Although some commercial and open source databases similar to PostgreSQL excel at handling structured data, PostgreSQL also offers a powerful tooling for searching unstructured or semi-structured data. PostgreSQL provides built-in full-text search (FTS), and other extensions like pg_trgm and pg_bigm for text searching. Traditional SQL queries using LIKE and ILIKE operators and regular expressions are excellent for exact textual matching and structured data retrieval. However, they may have limitations in searching large blocks of unstructured text, such as documents, articles, or even product descriptions.

There are various factors to take into account when migrating from a commercial database like SQL Server to an open source database like PostgreSQL. Migrating full-text search from SQL Server to Amazon Aurora PostgreSQL-Compatible Edition requires modifying queries and schema structure because the implementation of full-text search is different in both databases and AWS Schema Conversion Tool (AWS SCT) doesn’t automatically convert full-text search code.

SQL Server FTS is designed to search for specific words, phrases, or even word forms (called stemming) within unstructured or semi-structured text data. It enables fast searching, ranking, and indexing of textual content, making it an invaluable asset for applications that need to handle vast amounts of text-based information.

In this post, we show you how to migrate full-text search in Microsoft SQL Server to Amazon Aurora PostgreSQL using text searching data types tsvector and tsquery. We also show you how to implement FTS using pg_trgm and pg_bigm extensions.

Prerequisites

For this post, we use the AdventureWorks2019 sample database to show how to migrate FTS from SQL Server 2019 Standard Edition to PostgreSQL. The following are the high-level steps to implement FTS on a SQL Server database:

Enable full-text search for the database AdventureWorks2019:

USE [AdventureWorks2019] 
GO 
EXEC sp_fulltext_database 'enable' 
GO

Create a full-text catalog:
```
CREATE FULLTEXT CATALOG DescFTSCatalog; 
GO
```
A full-text catalog is a logical container for full-text indexes. It organizes the full-text index data and defines the language-specific word breakers and stemmers.

Define a full-text index on the columns containing text data that you want to search:

CREATE FULLTEXT INDEX 
ON [AdventureWorks2019].[Production].[ProductDescription]([Description]) 
KEY INDEX [PK_ProductDescription_ProductDescriptionID] 
ON DescFTSCatalog 
GO

Use AWS SCT and AWS Database Migration Service (AWS DMS) to convert and migrate the AdventureWorks2019 database from SQL Server to Amazon Aurora PostgreSQL.

Refer to Migrate SQL Server to Amazon Aurora PostgreSQL using best practices and lessons learned from the field for migration best practices. For this post, we migrate the ProductDescription table from SQL Server to Amazon Aurora PostgreSQL using AWS SCT and AWS DMS.

PostgreSQL has a few options for searching through text: exact search, pattern matching, regular expressions, and full-text search. In the following section, we show you how to implement FTS in PostgreSQL on the migrated database to achieve similar results.

Full-text search in PostgreSQL

The LIKE operator, ILIKE operator and regular expression are used in a query’s WHERE clause for string pattern-based matching searches using wildcard characters. Search results can be ranked based on how close they are with the search pattern, LIKE and ILIKE do not provide the ranking facility and ignore frequently used words like “the”, “is,” and so on. PostgreSQL built-in full-text search provides search functionality through tsvector and tsquery data types along with related functions, operators, and parameters.

The tsvector data type represents a preprocessed and transformed version of text data, optimized for efficient full-text searching. It stores information about the individual words or tokens in a text document, allowing for quick and accurate text search operations. You can use the to_tsvector function to convert text data (documents) to tsvector data, which is used for searching data efficiently.

The tsquery data type is composed of one or more lexemes, which are the basic units of text used for searching. Lexemes can be simple words or tokens, and they can also be specified in various forms, such as prefix or exact forms. By combining lexemes and operators, you can create complex search conditions to find the most relevant results for your text search queries. to_tsquery is used to convert the search text to lexemes, which can be used to search against tsvector columns and values. For example, if you use to_tsquery on the text “He is running in the park,” the lexemes generated are “he,” “run,” and “park.” Words like “is,” “in,” and “the” are removed because they’re stop words. The word “running” is stemmed to “run” and words are made lowercase.

In the following sections, we’ll demonstrate several use-cases for full-text search, and provide strategies for how you can migrate them from SQL Server to Amazon Aurora PostgreSQL-compatible and Amazon RDS for PostgreSQL.

CONTAINS predicate with AND operator

Simple FTS queries in SQL Server use the CONTAINS predicate. The CONTAINS predicate in Transact-SQL provides a versatile way to perform advanced FTS in SQL Server databases. It supports various search conditions, proximity searches, wildcards, and thesaurus features, allowing you to tailor your queries to match specific requirements.

In the following sample query, the CONTAINS predicate checks for the words “entry” and “level” in the Description column:

SELECT ProductDescriptionID,Description 
FROM [AdventureWorks2019].[Production].[ProductDescription] 
WHERE CONTAINS([Description], 'entry & level');

You can rewrite the query in PostgreSQL using the to_tsvector and to_tsquery functions as follows, using the default built-in text search dictionary value pg_catalog.simple. This configuration is driven by the parameter default_text_search_config.

SQL Server

SELECT ProductDescriptionID, Description 
FROM [AdventureWorks2019].[Production].[ProductDescription] 
WHERE CONTAINS([Description], 'entry & level');

PostgreSQL

SELECT productdescriptionid, description 
FROM production.productdescription 
WHERE to_tsvector('pg_catalog.simple',Description) @@ to_tsquery('pg_catalog.simple','entry & level');

CONTAINS predicate with OR operator

This is similar to the earlier use case using the CONTAINS predicate, except that the check is performed using the OR operator. In the following sample query, the predicate checks for “entry,” “level,” or both:

SELECT ProductDescriptionID,Description 
FROM [AdventureWorks2019].[Production].[ProductDescription] 
WHERE CONTAINS([Description], 'entry | level');

You can rewrite the query in PostgreSQL using the to_tsvector and to_tsquery functions as follows and using the default built-in text search dictionary value pg_catalog.simple.

SQL Server

SELECT ProductDescriptionID, Description 
FROM [AdventureWorks2019].[Production].[ProductDescription] 
WHERE CONTAINS([Description], 'entry | level');

PostgreSQL

SELECT productdescriptionid, description
FROM production.productdescription 
WHERE to_tsvector('pg_catalog.simple',Description) @@ to_tsquery('pg_catalog.simple','entry | level');

FREETEXT predicate

The FREETEXT predicate in Transact-SQL (T-SQL) is used to perform full-text searches in SQL Server databases. Unlike the CONTAINS function, which requires specific terms and conditions, FREETEXT allows more flexible and natural language-based searches.

In the following sample queries, FREETEXT checks for the words “entry” or “level” and their forms (using stemming) in the Description column:

SELECT ProductDescriptionID, Description 
FROM [AdventureWorks2019].[Production].[Product description] 
WHERE FREETEXT([Description], 'entry level');

You can rewrite the query in PostgreSQL using the to_tsvector and to_tsquery functions as follows using the configuration value pg_catalog.english. This configuration uses english_stem and a simple dictionary to convert tokens to lexemes. A lexeme, therefore, represents a normalized form of a word or token that can be indexed and used for search operations.

SQL Server

SELECT ProductDescriptionID, Description 
FROM [AdventureWorks2019].[Production].[ProductDescription] 
WHERE FREETEXT([Description], 'entry level');

PostgreSQL

SELECT productdescriptionid, description
FROM production.productdescription 
WHERE to_tsvector('pg_catalog.english',Description) @@ replace(plainto_tsquery('pg_catalog.english','entry level')::varchar, '&','|')::tsquery;

FREETEXTTABLE function with RANK

FTS in SQL Server can generate an optional score (or rank value) that shows the relevance of the data returned by a full-text query. This rank value is calculated on every row and can be used as an ordering criteria to sort the result set of a query by relevance. The rank values show only a relative order of relevance of the rows in the result set. The actual values are unimportant and typically differ each time you run the query. The rank value doesn’t hold any significance across queries.

In the following sample queries, FREETEXTTABLE checks for the words “entry” or “level” and their forms (using stemming) in the Description column and also gets the RANK information:

SELECT FT_TBL.[ProductDescriptionID],FT_TBL.[Description], KEY_TBL.[RANK] 
FROM [AdventureWorks2019].[Production].[ProductDescription] FT_TBL
     INNER JOIN FREETEXTTABLE([AdventureWorks2019].[Production].[ProductDescription], [Description], 'entry OR level',1033) AS KEY_TBL
      ON  FT_TBL.[ProductDescriptionID] =KEY_TBL.[KEY]
ORDER BY KEY_TBL.[RANK] DESC,FT_TBL.[ProductDescriptionID];

In PostgreSQL, the ts_rank function is used to calculate the relevance ranking of search results based on their match to a specific query. The ranking is computed using a numerical value that represents how closely a document matches the search terms in the query.

SQL Server

SELECT FT_TBL.[ProductDescriptionID],FT_TBL.[Description], KEY_TBL.[RANK] 
FROM [AdventureWorks2019].[Production].[ProductDescription] FT_TBL 
INNER JOIN FREETEXTTABLE([AdventureWorks2019].[Production].[ProductDescription]
                         ,[Description], 'entry OR level',1033) AS KEY_TBL 
ON FT_TBL.[ProductDescriptionID] = KEY_TBL.[KEY] 
ORDER BY KEY_TBL.[RANK] DESC,FT_TBL.[ProductDescriptionID];

PostgreSQL

SELECT p.productdescriptionid ,p.description, 
       ts_rank(to_tsvector('pg_catalog.english',p.Description), query) AS rank, 
       ts_headline('pg_catalog.english',p.Description,query) headline 
FROM production.productdescription p, to_tsquery('pg_catalog.english','entry | level') query 
WHERE query @@ to_tsvector('pg_catalog.english',p.Description) ORDER BY rank DESC, p.productdescriptionid;

The ts_headline function is used to generate a summarized version of a document’s text that highlights the most relevant portions matching a specific search query. This function is useful for creating search result snippets or headlines that provide context to users about why a certain document applies to their search. The following screenshot shows the output of the PostgreSQL query’s headline column generated using the ts_headline function.

CONTAINSTABLE and FORMSOF functions with RANK

The FORMSOF function in SQL Server is used to perform inflectional searches. Inflectional searches involve searching for various forms of a word, such as plurals, verb tenses, or related word forms. This can help you find relevant documents even if they contain variations of the search term, thereby improving the search accuracy.

In the following sample queries, CONTAINSTABLE checks for the word “gear” and their forms (using INFLECTIONAL) in the Description column and also gets the RANK information:

SELECT FT_TBL.[ProductDescriptionID],FT_TBL.[Description], KEY_TBL.[RANK] 
FROM [AdventureWorks2019].[Production].[ProductDescription] FT_TBL 
    INNER JOIN CONTAINSTABLE([AdventureWorks2019].[Production].[ProductDescription],
                             [Description], 'FORMSOF(INFLECTIONAL,''gear'')',1033) AS KEY_TBL 
ON FT_TBL.[ProductDescriptionID] = KEY_TBL.[KEY] 
ORDER BY KEY_TBL.[RANK] DESC,FT_TBL.[ProductDescriptionID];

In PostgreSQL queries, the phrases are first broken into words or tokens, and these words are normalized and stemmed to form base words (lexemes) using the pg_catalog.english FTS configuration. These lexemes are going to be the same for different forms (stemming) of a word. Therefore, this will automatically handle inflectional searches.

SQL Server

SELECT FT_TBL.[ProductDescriptionID],FT_TBL.[Description], KEY_TBL.[RANK] 
FROM [AdventureWorks2019].[Production].[ProductDescription] FT_TBL 
     INNER JOIN CONTAINSTABLE([AdventureWorks2019].[Production].[ProductDescription],
                  [Description], 'FORMSOF(INFLECTIONAL,''gear'')',1033) AS KEY_TBL 
ON FT_TBL.[ProductDescriptionID] = KEY_TBL.[KEY] 
ORDER BY KEY_TBL.[RANK] DESC,FT_TBL.[ProductDescriptionID];

PostgreSQL

SELECT p.productdescriptionid ,p.description, 
       ts_rank(to_tsvector('pg_catalog.english',p.Description), query) AS rank, 
       ts_headline('pg_catalog.english',p.Description,query) headline 
FROM production.productdescription p, to_tsquery('pg_catalog.english','gear') query 
WHERE query @@ to_tsvector('pg_catalog.english',p.Description) 
ORDER BY rank desc,p.productdescriptionid;

Improving query performance in PostgreSQL

For the sample PostgreSQL queries shown earlier, the to_tsvector function gets the tsvector values out of the Description column from the productdescription table. In the following sections, we show you various options to improve query performance.

Solution 1: Use a GIN index

The GIN (Generalized Inverted Index) index in PostgreSQL is a popular indexing method used to efficiently speed up the search for complex data types such as JSON and full-text search. The standard database index, a B-tree, is designed to work for testing equality, whereas GIN is designed for search patterns that work over nested or composite data structures to allow for more expressive search patterns. By indexing components of complex data types separately, GIN indexes enable faster queries on arrays, JSON data, and text search operations. This makes GIN indexes a valuable tool for improving the performance of queries involving complex data structures in PostgreSQL databases.

In this approach, you create an expression-based GIN index on the column of interest in the productdescription table.

Run the following command:
```
CREATE INDEX productdescription_gin_idx ON production.productdescription 
USING GIN (to_tsvector('pg_catalog.english', Description));
```
If the table has millions of rows, you can increase the maintenance_work_mem configuration parameter at session level to speed up the index creation time. maintenance_work_mem specifies the maximum amount of memory in MB to be used by maintenance operations like the creation of INDEX—by default(PostgreSQL), it’s 64 MB.

Run the following EXPLAIN ANALYZE query:

EXPLAIN ANALYZE 
SELECT * FROM production.productdescription 
WHERE to_tsvector('pg_catalog.english', Description) @@ to_tsquery('pg_catalog.english','entry | level') 
ORDER BY productdescriptionid DESC;

The output shows a bitmap index scan being performed on productdescription_gin_index, which improves the query performance. The following screenshot shows the explain plan before creating the index.

The following screenshot shows the explain plan after creating the index.

In this case, we can see a speedup in query performance when using the GIN index. While in general, using a GIN index for full-text search in PostgreSQL can help improve performance, you’ll need to be aware of other performance tradeoffs, including the time required to build an index and additional storage an index requires.

Solution 2: Use a stored generated column

In this approach, you create a computed column description_tsv that holds the tsvector value out of the description column from the table followed by a GIN index on the computed column.

Run the following commands:

ALTER TABLE production.productdescription 
ADD COLUMN description_tsv tsvector 
GENERATED ALWAYS AS (to_tsvector('pg_catalog.english',Description )) STORED; 
CREATE INDEX productdescription_gin_idx ON production.productdescription USING GIN(description_tsv);

Run the following sample EXPLAIN ANALYZE query:

EXPLAIN ANALYZE 
SELECT * 
FROM production.productdescription 
WHERE description_tsv @@ to_tsquery('pg_catalog.english','entry | level') 
ORDER BY productdescriptionid DESC;

The output shows that a bitmap index scan is being performed on productdescription_gin_index, which in this case demonstrated an improvement in query performance:

Full-text search in PostgreSQL using the pg_trgm extension

In PostgreSQL, the pg_trgm extension is implemented for text search functionalities utilizing trigrams. Trigrams are essentially sets of three consecutive characters extracted from a given string. By employing trigrams, users can identify resemblances or matches in text patterns within strings by comparing the number of trigrams that align between the strings, alongside the predefined similarity threshold parameter established prior to executing the search.

The pg_trgm extension provides operators that can be used to create trigram indexes on text columns in the table that need to be searched. This index allows for efficient similarity operations on the indexed columns. The extension provides the three similarity operations: similarity (%), word_similarity (<%), and strict_word_similarity (<<%). The threshold parameters for respective operations are pg_trgm.similarity_threshold, pg_trgm.word_similarity_threshold, and pg_trgm.strict_word_similarity_threshold, which can be set to a value between 0 (no similarity) and 1 (perfect match). The functions similarity(), word_similarty(), and strict_word_similarity() are used to calculate the similarity score. You can use pg_trgm as shown in following code:

Run the following command to create the pg_trgm extension:
CREATE EXTENSION pg_trgm;

Run the following command to create the GIN index on the productdescription column:

CREATE INDEX productdescription_trgm_idx ON production.productdescription USING GIN (Description gin_trgm_ops);

Run the following command to set the similarity_threshold configuration value to 0.2. similarity checks for common trigrams between two strings and returns a value between 0–1.

SET pg_trgm.similarity_threshold = 0.2; 
SET enable_seqscan = off; 
SELECT productdescriptionid, Description, similarity(Description, 'entry level') AS sml 
FROM production.product description 
WHERE Description % 'entry level' 
ORDER BY sml DESC, Description;

Run the following command to set the word_similarity_threshold configuration value to 0.6. word_similarity checks for the common trigrams between strings at word level.

SET pg_trgm.word_similarity_threshold = 0.6; 
SET enable_seqscan = off; 
SELECT productdescriptionid, Description, word_similarity('entry level', Description) AS sml 
FROM production.productdescription 
WHERE 'entry level' <% Description 
ORDER BY sml DESC, Description;

Run the following command to set the strict_word_similarity_threshold configuration value to 0.6. strict_word_similarity is like word_similarity but it considers common trigrams only when both words are identical.

SET pg_trgm.strict_word_similarity_threshold = 0.6; 
SET enable_seqscan = off; 
SELECT productdescriptionid, Description, strict_word_similarity('aluminum cups and hollow axle', Description) AS sml 
FROM production.productdescription 
WHERE 'entry level' <<% Description 
ORDER BY sml DESC, Description;

Run the following command to drop the index and enable sequential scan:
```
DROP INDEX production.productdescription_trgm_idx;
```

Full-text search in PostgreSQL using the pg_bigm extension

The pg_bigm extension in PostgreSQL enhances full-text search capabilities, particularly for languages with complex character sets like Asian languages.

A bigram is a group of two consecutive characters taken from a string. The extension uses a bigram indexing approach, which involves breaking text into pairs of consecutive characters and building an index based on these bigrams. The pg_bigm extension provides the bigm_similarity() function, bigm similarity operator = %, and pg_bigm.similarity_limit threshold parameter. You can use pg_bigm as follows:

Run the following command to create the pg_bigm extension. For instructions to create the extension in Amazon RDS for PostgreSQL, refer to Using PostgreSQL extensions with Amazon RDS for PostgreSQL.
```
CREATE EXTENSION pg_bigm;
```

Run the following command to create the GIN index on the productdescription column:

CREATE INDEX productdescription_bigm_idx ON production.productdescription USING gin (Description gin_bigm_ops);

Run the following command to set the similarity_limit configuration value to 0.15. similarity checks for common bigrams between two strings and returns a value between 0–1.

SET pg_bigm.similarity_limit TO 0.15; 
SELECT *,bigm_similarity(Description, '%entry level%') rank1 
FROM production.productdescription 
WHERE Description =% '%entry level%' 
ORDER BY rank1 DESC;

Run the following command drop the index and enable sequential scan:

DROP INDEX production.productdescription_bigm_idx;
SET enable_seqscan = on;

Conclusion

In this post, we showed you how to migrate FTS from SQL Server to PostgreSQL and compared a few common use cases. Migrating the full-text search from SQL Server to PostgreSQL requires a manual rewrite of code. For further reading, refer to the limitations of PostgreSQL’s text search features. We also showed you how to use the pg_trgm and pg_bigm extensions in PostgreSQL to implement FTS.

Share your thoughts with us in the comments.

AWS Database Blog

Migrate full-text search from SQL Server to Amazon Aurora PostgreSQL-compatible edition or Amazon RDS for PostgreSQL

Prerequisites

Full-text search in PostgreSQL

CONTAINS predicate with AND operator

CONTAINS predicate with OR operator

FREETEXT predicate

FREETEXTTABLE function with RANK

CONTAINSTABLE and FORMSOF functions with RANK

Improving query performance in PostgreSQL

Solution 1: Use a GIN index

Solution 2: Use a stored generated column

Full-text search in PostgreSQL using the pg_trgm extension

Full-text search in PostgreSQL using the pg_bigm extension

Conclusion

About the authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help