AWS Database Blog
Working with accent-insensitive collations with Babelfish for Aurora PostgreSQL
Babelfish for Aurora PostgreSQL includes support for the SQL Server wire-protocol and T-SQL, which is the query language used in Microsoft SQL Server. This means that developers can use Babelfish to run their existing SQL Server applications on Amazon Aurora PostgreSQL-Compatible Edition without having to switch database drivers or completely rewrite their queries.
If you’re working with Latin languages or any other language with accents in your database, you may have a requirement for a CI_AI
collation. CI
stands for case-insensitive, which allows you to sort and compare text without regard to case, with regards to case, we also have CS
(case-sensitive) which will do the opposite of CI
, it differentiates upper and lower case when filtering, sorting and comparing texts, for accentuation we also have 2 variations AS
(case-sensitive) which differentiates words with and without accents and AI
(accent-insensitive), which allows you to sort and compare text data without regard to diacritics (which include accents and other glyphs added to a letter such as cedilla (ç), circumflex (ô), umlaut (ö), tilde (ñ), and more.). This can be especially useful if you’re working with languages like Portuguese, French, Spanish, or other languages which use accents on certain letters. A common case are names, when searching for a name in an application, a user may not type the correct name and ends up not finding the desired record, for example Joao or João, Jurgen or Jürgen.
Babelfish for Aurora PostgreSQL supports 35 collations from SQL Server which can be used as a server or an object collation. You can also run the query SELECT * FROM fn_helpcollations()
on Babelfish for the list of supported collations for your objects.
Note: Currently, you will find 141 collations supported in Babelfish by querying fn_helpcollations() , but note that not all of them can be used as the default server collation (the 35 collations which are shown in the dropdown menu in the RDS console when creating the instance), however, you can use these collations when creating your object or in your expressions. |
The default collation is sql_latin1_general_cp1_ci_as
. The _as
in this collation means is accent-sensitive, which means the database will distinguish between accented and non-accented characters, for example José
is not equal to Jose
. However, if your application has the requirement of using accent-insensitive (AI
) collations, currently Babelfish doesn’t support accent-insensitive as a default collation, you need to use an alternative.
The default collations used in T-SQL and PostgreSQL SQL are not identical, we will have different behavior when filtering or sorting records and this can lead to semantic differences.
In this post, we walk through how to use CI_AI
and CS_AI
collations on Babelfish for Aurora PostgreSQL, allowing you to keep the support for Latin-based languages or any other language with accents.
Pre-requisites and Limitations
You must meet the following prerequisites in order to use CI_AI collation with Babelfish:
- Babelfish for Aurora PostgreSQL DB Cluster version 2.1.2 (PG 14.3) or later
You may get an error message for the conflict collation if you are not explicitly using the COLLATE clause on a Babelfish lower than version 2.1.2 - SQL Server Management Studio (SSMS)
- Permission to connect to your Babelfish cluster from SSMS.
What are CI_AI / CS_AI collations?
CI_AI
collations are a type of collation that are designed to be both case-insensitive and accent-insensitive. This means that strings are treated as equal if they have the same characters, regardless of their case or accent marks.
CI_AS
collations, are case-insensitive, and accent-sensitive. This means that strings are treated equally for uppercase and lowercase but not for accent marks.
For example, consider the following two strings: José
and jose
. With CI_AS
or CS_AS
collation, these two strings would not be considered equal. However, with a CI_AI
collation, they would be considered equal, because the collation ignores case and accent marks, rendering both strings identical.
Why use CI_AI collations in Babelfish?
There are several reasons why you might consider using CI_AI
collations in your Babelfish projects:
- Improved data retrievals – By ignoring case and accent marks,
CI_AI
collations can help to ensure that your data is correctly sorted and compared, even in cases where the linguistic rules for sorting or comparing strings can be complex or inconsistent. - Better support for multilingual data retrievals –
CI_AI
collations are especially useful when working with multilingual data, because they allow you to sort and compare strings regardless of the case or accent marks within the linguistic difference between languages. - User-friendliness – With CI_AI collations, you can reduce the number of comparisons that need to be made, which can help improve the performance of your applications.
The following script will create a sample database and table to evaluate the behavior of a database with CI_AI
collation on SQL Server:
Note that if you run CREATE DATABASE [DB-SQL-test] COLLATE Latin1_General_CI_AI
on Babelfish, the command fails because the collation Latin1_General_CI_AI
is not supported as a default collation in your database.
The outcome of the SELECT query on SQL Server is as follows:
Note that even a query with an uppercase and non-accented filter, SQL Server returns all records referring to JOSE
, because we are using a dataset with collation CI and AI.
Let’s run another query on SQL Server:
The outcome of the SELECT query on SQL Server is as follows:
Note that the query returned all records, even using a capitalized and accented filter, SQL Server returns all records referring to José
, because the CI_AI
collation ignores case and accent marks, rendering all strings identical.
Now let’s run the same process on Babelfish, with the exception of the collate clause:
The following is the outcome of the query on Babelfish:
Babelfish returns just two rows, because Babelfish supports CI collation but doesn’t support AI collation on the database level.
Use CI_AI collation on Babelfish for Aurora PostgreSQL
An alternative to using CI_AI
collation is to indicate the right collation during the table creation. You can create the Babelfish instance using the default collation, but when creating the table, you indicate the right collation for the text columns:
The outcome of the query on Babelfish is as follows:
By specifying the collate clause in the text column during table creation, we now have the desired behavior: the query is returning all records, regardless of whether the “e” has an accent or not.
Test join and sort operations
Let’s test the behavior of our solution when sorting and joining tables with and without the collate clause:
The outcome of the query on Babelfish is as follows:
The join returned only two records, because we created tbPrice
without the CI_AI
clause, the join operator ignores the accent records. Now, let’s run the join with the tbProduct2
that is using the collate CI_AI
clause:
Note: If you are running Babelfish version lower than 2.1.2 (PG 14.3), you will get an error related to collation conflict:
“could not determine which collation to use for string comparison”
The following is the outcome of the query on Babelfish:
By using one of the tables that has CI_AI
collation, we can see all records (CI and AI). If we try to simulate the same behavior in SQL Server, we have a different result:
The outcome of the query on SQL Server is as follows:
Msg 468, Level 16, State 9, Line 25 Cannot resolve the collation conflict between "Latin1_General_CI_AI" and "SQL_Latin1_General_CP1_CI_AS" in the equal to operation.
In this scenario, the SQL Server can’t handle the collation conflict, which will require developers to use the collate
clause or change the table collation to avoid this issue, whereas Babelfish returns the record considering the CI_AI
.
Now let’s check how the sort operation differs between SQL Server and Babelfish. First, let’s observe the SQL Server behavior:
The following table shows the outcome of the query on SQL Server.
Ascending order | Descending order |
---|---|
In the SQL Server scenario, the result is the same regardless of the sorting mode (ASC
or DESC
). Now let’s run the same script in Babelfish:
The following table shows the outcome of the query on Babelfish.
Ascending order | Descending order |
---|---|
In the Babelfish scenario, the results are different: ASC
returned the unaccented records first and DESC
returned the opposite. Also note that both results are different from the SQL Server result. This is because the way PostgreSQL treats the sort operations. If you run the same query on tbProduct2
that is considering the collate clause, you also get a different result:
The following table shows the outcome of the query on Babelfish.
Ascending order | Descending order |
---|---|
In the Babelfish scenario, when querying and sorting a table that has a collate clause, the results are now identical to SQL Server.
Expliciting a collate clause in a T-SQL expression
If you don’t want to or can’t change the table schema, you can add the collate clause in your query to have the same behavior:
The outcome of the query on Babelfish is as follows:
You can also use the same approach to join tables:
The outcome of the query on Babelfish is as follows:
Let’s simulate the case of ordering results, run the following command on SQL Server:
The outcome of the query on SQL Server is as follows:
Now let’s run the same command on Babelfish: