AWS Database Blog

Working with accent-insensitive collations with Babelfish for Aurora PostgreSQL

Babelfish for Aurora PostgreSQL includes support for the SQL Server wire-protocol and T-SQL, which is the query language used in Microsoft SQL Server. This means that developers can use Babelfish to run their existing SQL Server applications on Amazon Aurora PostgreSQL-Compatible Edition without having to switch database drivers or completely rewrite their queries.

If you’re working with Latin languages or any other language with accents in your database, you may have a requirement for a CI_AI collation. CI stands for case-insensitive, which allows you to sort and compare text without regard to case, with regards to case, we also have CS (case-sensitive) which will do the opposite of CI, it differentiates upper and lower case when filtering, sorting and comparing texts, for accentuation we also have 2 variations AS (case-sensitive) which differentiates words with and without accents and AI (accent-insensitive), which allows you to sort and compare text data without regard to diacritics (which include accents and other glyphs added to a letter such as cedilla (ç), circumflex (ô), umlaut (ö), tilde (ñ), and more.). This can be especially useful if you’re working with languages like Portuguese, French, Spanish, or other languages which use accents on certain letters. A common case are names, when searching for a name in an application, a user may not type the correct name and ends up not finding the desired record, for example Joao or João, Jurgen or Jürgen.

Babelfish for Aurora PostgreSQL supports 35 collations from SQL Server which can be used as a server or an object collation. You can also run the query SELECT * FROM fn_helpcollations() on Babelfish for the list of supported collations for your objects.

Note: Currently, you will find 141 collations supported in Babelfish by querying fn_helpcollations(), but note that not all of them can be used as the default server collation (the 35 collations which are shown in the dropdown menu in the RDS console when creating the instance), however, you can use these collations when creating your object or in your expressions.

The default collation is sql_latin1_general_cp1_ci_as. The _as in this collation means is accent-sensitive, which means the database will distinguish between accented and non-accented characters, for example José is not equal to Jose. However, if your application has the requirement of using accent-insensitive (AI) collations, currently Babelfish doesn’t support accent-insensitive as a default collation, you need to use an alternative.

The default collations used in T-SQL and PostgreSQL SQL are not identical, we will have different behavior when filtering or sorting records and this can lead to semantic differences.

In this post, we walk through how to use CI_AI and CS_AI collations on Babelfish for Aurora PostgreSQL, allowing you to keep the support for Latin-based languages or any other language with accents.

Pre-requisites and Limitations

You must meet the following prerequisites in order to use CI_AI collation with Babelfish:

What are CI_AI / CS_AI collations?

CI_AI collations are a type of collation that are designed to be both case-insensitive and accent-insensitive. This means that strings are treated as equal if they have the same characters, regardless of their case or accent marks.

CI_AS collations, are case-insensitive, and accent-sensitive. This means that strings are treated equally for uppercase and lowercase but not for accent marks.

For example, consider the following two strings: José and jose. With CI_AS or CS_AS collation, these two strings would not be considered equal. However, with a CI_AI collation, they would be considered equal, because the collation ignores case and accent marks, rendering both strings identical.

Why use CI_AI collations in Babelfish?

There are several reasons why you might consider using CI_AI collations in your Babelfish projects:

  • Improved data retrievals – By ignoring case and accent marks, CI_AI collations can help to ensure that your data is correctly sorted and compared, even in cases where the linguistic rules for sorting or comparing strings can be complex or inconsistent.
  • Better support for multilingual data retrievalsCI_AI collations are especially useful when working with multilingual data, because they allow you to sort and compare strings regardless of the case or accent marks within the linguistic difference between languages.
  • User-friendliness – With CI_AI collations, you can reduce the number of comparisons that need to be made, which can help improve the performance of your applications.

The following script will create a sample database and table to evaluate the behavior of a database with CI_AI collation on SQL Server:

--- creating a sample database on SQL Server
CREATE DATABASE [DB-SQL-test] COLLATE Latin1_General_CI_AI
GO
USE [DB-SQL-test]
GO
--- creating test table
CREATE TABLE tbProduct (idProd int, Item nvarchar(50))
GO
--- insert CS and CI records 
INSERT INTO tbProduct VALUES (1,'José'),(2,'Jose'),(3,'josé'),(4,'jose')
GO
-- Query filtering with uppercase and non-accented text
SELECT * FROM tbProduct WHERE Item = 'JOSE'

Note that if you run CREATE DATABASE [DB-SQL-test] COLLATE Latin1_General_CI_AI on Babelfish, the command fails because the collation Latin1_General_CI_AI is not supported as a default collation in your database.

The outcome of the SELECT query on SQL Server is as follows:

idProd Item
------ ----------
1      José
2      Jose
3      josé
4      jose

(4 rows affected)

Note that even a query with an uppercase and non-accented filter, SQL Server returns all records referring to JOSE, because we are using a dataset with collation CI and AI.

Let’s run another query on SQL Server:

-- Query filtering with capitalized and accented text
USE [DB-SQL-test]
GO
SELECT * FROM tbProduct WHERE Item = 'José'

The outcome of the SELECT query on SQL Server is as follows:

idProd Item
------ ----------
1      José
2      Jose
3      josé
4      jose

(4 rows affected)

Note that the query returned all records, even using a capitalized and accented filter, SQL Server returns all records referring to José, because the CI_AI collation ignores case and accent marks, rendering all strings identical.

Now let’s run the same process on Babelfish, with the exception of the collate clause:

--- creating a sample database on Babelfish
CREATE DATABASE [DB-BBF-test] 
GO
USE [DB-BBF-test]
GO
--- creating test table
CREATE TABLE tbProduct (idProd int, Item nvarchar(50))
GO
--- insert CS and CI rows 
INSERT INTO tbProduct VALUES (1,'José'),(2,'Jose'),(3,'josé'),(4,'jose')
GO
-- Query filtering with uppercase and non-accented text
SELECT * FROM tbProduct WHERE Item = 'JOSE'

The following is the outcome of the query on Babelfish:

idProd Item
------ ----------
2      Jose
4      jose

(2 rows affected)

Babelfish returns just two rows, because Babelfish supports CI collation but doesn’t support AI collation on the database level.

Use CI_AI collation on Babelfish for Aurora PostgreSQL

An alternative to using CI_AI collation is to indicate the right collation during the table creation. You can create the Babelfish instance using the default collation, but when creating the table, you indicate the right collation for the text columns:

USE [DB-BBF-test]
GO
--- creating test table specifying collate clause
CREATE TABLE tbProduct2 (idProd int, Item nvarchar(50) collate sql_latin1_general_cp1_ci_ai)
GO
--- insert CS and CI records 
INSERT INTO tbProduct2 VALUES (1,'José'),(2,'Jose'),(3,'josé'),(4,'jose')
GO
-- Query filtering with uppercase and non-accented text
SELECT * FROM tbProduct2 WHERE Item = 'JOSE'

The outcome of the query on Babelfish is as follows:

idProd Item
------ ----------
1      José
2      Jose
3      josé
4      jose

(4 rows affected)

By specifying the collate clause in the text column during table creation, we now have the desired behavior: the query is returning all records, regardless of whether the “e” has an accent or not.

Test join and sort operations

Let’s test the behavior of our solution when sorting and joining tables with and without the collate clause:

USE [DB-BBF-test]
GO
--creating table to simulate join
create table tbprice (item varchar(20), price decimal(5,2))
go
insert into tbPrice values ('jose',10.30)
go
--joining tbProduct with tbPrice
SELECT a.*, b.price FROM tbProduct a INNER JOIN tbprice b ON a.Item = b.item
Go

The outcome of the query on Babelfish is as follows:

idProd Item price
------ ---- -----
2      Jose 10.30
4      jose 10.30

(2 rows affected)

The join returned only two records, because we created tbPrice without the CI_AI clause, the join operator ignores the accent records. Now, let’s run the join with the tbProduct2 that is using the collate CI_AI clause:

USE [DB-BBF-test]
GO
SELECT a.*, b.price FROM tbProduct2 a INNER JOIN tbprice b ON a.Item = b.item

Note: If you are running Babelfish version lower than 2.1.2 (PG 14.3), you will get an error related to collation conflict:

“could not determine which collation to use for string comparison”

The following is the outcome of the query on Babelfish:

idProd Item price
------ ---- -----
1      José 10.30
2      Jose 10.30
3      josé 10.30
4      jose 10.30

(4 rows affected)

By using one of the tables that has CI_AI collation, we can see all records (CI and AI). If we try to simulate the same behavior in SQL Server, we have a different result:

--Execute this command on SQL Server
use [DB-SQL-test]
go
CREATE TABLE #tbTemp (id int, item NVARCHAR(50) )
INSERT INTO #tbTemp VALUES (1,'José')
SELECT a.*, b.* FROM #tbTemp a INNER JOIN tbProduct b ON a.item = b.item

The outcome of the query on SQL Server is as follows:

Msg 468, Level 16, State 9, Line 25
Cannot resolve the collation conflict between "Latin1_General_CI_AI" and "SQL_Latin1_General_CP1_CI_AS" in the equal to operation.

In this scenario, the SQL Server can’t handle the collation conflict, which will require developers to use the collate clause or change the table collation to avoid this issue, whereas Babelfish returns the record considering the CI_AI.

Now let’s check how the sort operation differs between SQL Server and Babelfish. First, let’s observe the SQL Server behavior:

--Execute this command at SQL Server
use [DB-SQL-test]
go
--Ascending sort
SELECT * FROM tbProduct ORDER BY Item ASC
GO
-- Descending sort
SELECT * FROM tbProduct ORDER BY Item DESC

The following table shows the outcome of the query on SQL Server.

Ascending order Descending order
idProd Item
------ ----------
1      José
2      Jose
3      josé
4      jose

(4 rows affected)
idProd Item
------ ----------
1      José
2      Jose
3      josé
4      jose

(4 rows affected)

In the SQL Server scenario, the result is the same regardless of the sorting mode (ASC or DESC). Now let’s run the same script in Babelfish:

--Execute this command in Babelfish
use [DB-BBF-test]
go
-- Ascending sort - table w/o collate
SELECT * FROM tbProduct ORDER BY Item ASC
GO
-- Descending sort - table w/o collate
SELECT * FROM tbProduct ORDER BY Item DESC

The following table shows the outcome of the query on Babelfish.

Ascending order Descending order
idProd Item
------ ----------
2      Jose
4      jose
1      José
3      josé

(4 rows affected)
idProd Item
------ ----------
1      José
3      josé
2      Jose
4      jose

(4 rows affected)

In the Babelfish scenario, the results are different: ASC returned the unaccented records first and DESC returned the opposite. Also note that both results are different from the SQL Server result. This is because the way PostgreSQL treats the sort operations. If you run the same query on tbProduct2 that is considering the collate clause, you also get a different result:

--Execute this command in Babelfish
use [DB-BBF-test]
go
-- Ascending sort - table w/ collate
SELECT * FROM tbProduct2 ORDER BY Item ASC
GO
-- Descending sort - table w/ collate
SELECT * FROM tbProduct2 ORDER BY Item DESC

The following table shows the outcome of the query on Babelfish.

Ascending order Descending order
idProd Item
------ ----------
1      José
2      Jose
3      josé
4      jose

(4 rows affected)
idProd Item
------ ----------
1      José
2      Jose
3      josé
4      jose

(4 rows affected)

In the Babelfish scenario, when querying and sorting a table that has a collate clause, the results are now identical to SQL Server.

Expliciting a collate clause in a T-SQL expression

If you don’t want to or can’t change the table schema, you can add the collate clause in your query to have the same behavior:

use [DB-BBF-test]
go
-- Query filtering with uppercase and non-accented text
SELECT * FROM tbProduct WHERE Item = 'jose' collate sql_latin1_general_cp1_ci_ai

The outcome of the query on Babelfish is as follows:

idProd Item
------ ----------
1      José
2      Jose
3      josé
4      jose

(4 rows affected)

You can also use the same approach to join tables:

use [DB-BBF-test]
go
-- Query joining tables using collate clause
SELECT a.*, b.price FROM tbProduct a INNER JOIN tbprice b ON a.Item = b.item collate sql_latin1_general_cp1_ci_ai

The outcome of the query on Babelfish is as follows:

idProd Item  price
------ ----- -----
1      José  10.30
2      Jose  10.30
3      josé  10.30
4      jose  10.30

(4 rows affected)

Let’s simulate the case of ordering results, run the following command on SQL Server:

--Execute this command on SQL Server
USE [DB-SQL-test]
GO
--Creating a table and inserting some records
CREATE TABLE StudentList (FirstName nvarchar(50), LastName nvarchar(50))
GO
INSERT INTO StudentList VALUES ('Márcia','Oliveira'),('Marcia','García'),('Martha','Rivera'),('Mary','Major')
--Query to generate an ID ordered by First Name, LastName
SELECT FirstName, LastName,ROW_NUMBER() OVER(ORDER BY FirstName ASC) as 'GeneratedID' FROM StudentList

The outcome of the query on SQL Server is as follows:

FirstName  LastName  GeneratedID
---------- --------- ------------
Márcia Oliveira  1
Marcia García    2
Martha     Rivera    3
Mary       Major     4

(4 rows affected)

Now let’s run the same command on Babelfish:

--Execute this command in Babelfish
USE [DB-BBF-test]
GO
--Creating a table and inserting some records
CREATE TABLE StudentList (FirstName nvarchar(50), LastName nvarchar(50))
GO
INSERT INTO StudentList VALUES ('Márcia','Oliveira'),('Marcia','García'),('Martha','Rivera'),('Mary','Major')
--Query to generate an ID ordered by First Name, LastName
SELECT FirstName, LastName, ROW_NUMBER() OVER(ORDER BY FirstName ASC) as 'GeneratedID' FROM StudentList

The outcome of the query on Babelfish is as follows:

FirstName  LastName  GeneratedID
---------- --------- ------------
Marcia García    1
Márcia Oliveira  2
Martha     Rivera    3
Mary       Major     4

(4 rows affected)

Note that the order of records in Babelfish and SQL Server are not identical, Babelfish is considering the accent during the order while SQL is not considering it, one workaround to fix this behavior as mentioned in this post is to use the collate clause:

--Execute this command in Babelfish
USE [DB-BBF-test]
GO
--Fixing order with collate clause
SELECT FirstName, LastName, ROW_NUMBER() OVER(ORDER BY FirstName COLLATE sql_latin1_general_cp1_ci_ai ASC) as 'GeneratedID' FROM StudentList

The outcome of the query on Babelfish is as follows:

FirstName  LastName  GeneratedID
---------- --------- ------------
Márcia     Oliveira  1
Marcia     García    2
Martha     Rivera    3
Mary       Major     4

(4 rows affected)

Now we have the same results as SQL Server.

The same solution can be used on an ORDER BY clause. Run the following command on SQL Server:

--Execute this command on SQL Server
USE [DB-SQL-test]
GO
--Querying First and Last names ordered by First Name
SELECT FirstName, LastName FROM StudentList
ORDER BY FirstName ASC

The outcome of the query on SQL Server is as follows:

FirstName  LastName
---------- --------
Márcia     Oliveira
Marcia     García
Martha     Rivera
Mary       Major

(4 rows affected)

Now let’s run the same command on Babelfish:

--Execute this command in Babelfish
USE [DB-BBF-test]
GO
--Querying First and Last names ordered by First Name
SELECT FirstName, LastName FROM StudentList
ORDER BY FirstName ASC

The outcome of the query on Babelfish is as follows:

FirstName LastName
--------- --------
Marcia    García
Márcia    Oliveira
Martha    Rivera
Mary      Major

(4 rows affected)

As the previous example using row_number, note that the order of records in Babelfish and SQL Server are not identical, Babelfish is considering the accent during the order while SQL is not considering it, we can also use the collate clause as a workaround:

--Execute this command in Babelfish
USE [DB-BBF-test]
GO
--Fixing order with collate clause
SELECT FirstName, LastName FROM StudentList
ORDER BY FirstName COLLATE sql_latin1_general_cp1_ci_ai ASC

The outcome of the query on Babelfish is as follows:

FirstName LastName
--------- --------
Márcia    Oliveira
Marcia    García
Martha    Rivera
Mary      Major

(4 rows affected)

Now we have the same results as SQL Server.

Conclusion

In this post, I showed you the importance of the CI_AI collations when working with Latin-based languages or any other language with accents in Babelfish. By using a CI_AI collation during table creation or adding the collate clause in your query, you can sort and compare text data without regard to case or accents, making your queries more flexible, more compatible with the real world, and easier to write a query.

If you have any questions, comments, or suggestions, leave a comment.


About the Author

Marcelo FernandesMarcelo Fernandes is a Senior Database Architect within the AWS Professional Services team, bringing over 21 years of experience in the field of databases. Throughout his career, Marcelo has been dedicated to assisting customers with database-related challenges, including migration, design, and performance optimization.