AWS Big Data Blog

Case-insensitive collation support for string processing in Amazon Redshift

Amazon Redshift is a fast, fully managed, cloud-native data warehouse. Tens of thousands of customers have successfully migrated their workloads to Amazon Redshift. We hear from customers that they need case-insensitive collation for strings in Amazon Redshift in order to maintain the same functionality and meet their performance goals when they migrate their existing workloads from legacy, on-premises data warehouses like Teradata, Oracle, or IBM. With that goal in mind, AWS provides an option to create case-insensitive and case-sensitive collation.

In this post, we discuss how to use case-insensitive collation and how to override the default collation. Also, we specifically explain the process to migrate your existing Teradata database using the native Amazon Redshift collation capability.

What is collation?

Collation is a set of rules that tells a database engine how to compare and sort the CHAR and VARCHAR columns data in SQL. A case-insensitive collation ignores the differences between uppercase and lowercase letters for string comparison and sorting, whereas a case-sensitive collation does not. For example, in case-insensitive collation, “A” and “a” are equal. Different operations such as LIKE predicates, group by, order by, Regex, similar to behave based on the collation defined while the stored data keeps its original case.

We can define collation at three levels:

  • Database
  • Column
  • Expression

Database-level collation

You can specify collation when creating a new database. All VARCHAR and CHAR columns in the current database pick up the database-level collation as default if no column-level override exists. If you don’t specify collation for a new database, the default collation continues to be the current semantic of case-sensitive.

To create a database with collation in Amazon Redshift, use the following syntax:

CREATE DATABASE database_name [ WITH ]
[ COLLATE CASE_SENSITIVE | COLLATE CASE_INSENSITIVE ]

To alter a database with collation in Amazon Redshift, use the following syntax:

ALTER DATABASE database_name
COLLATE CASE_SENSITIVE | COLLATE CASE_INSENSITIVE

This action only works when the database is empty. Therefore, it’s important to use collation as a first step before any database objects are created.

To find the collation of an existing database, use the following query.

select db_collation();

This provides output as case_sensitive or case_insensitive.

Column-level collation

You can specify collation for a VARCHAR or CHAR column when creating a new table. This overrides the database-level collation.

To define a table with column-level collation in Amazon Redshift, use the following syntax:

create table T1 (
col1 varchar(20) collate case_sensitive encode zstd,
col2 char(20) collate case_insensitive encode zstd,
col3 varchar(20) encode zstd /* not specified, use collation of current database */
);

To find the collation defined at the individual column level, use the following query:

 SELECT TABLE_NAME, COLUMN_NAME, data_type, COLLATION_NAME
FROM svv_columns WHERE TABLE_NAME = 't1' ORDER BY COLUMN_NAME;

Expression-level collation

You can use the collate function in a SQL query to override collation at the column level and database level.

To use the collate function in Amazon Redshift, use the following syntax. This example code converts col1 to case-insensitive and compares with the value john:

select * from T1 where collate(col1, 'case_insensitive') = 'john';

Solution overview

In the following use case, we discuss how to convert a legacy Teradata database’s collation syntax to Amazon Redshift syntax. In Teradata, based on the session mode, the default CHAR and VARCHAR columns collation changes. We can override this default collation at the column and expression level. In Amazon Redshift, we can define collation at the database, column, and expression levels.

In this use case, let’s consider the migration of two tables, invoice and customer, and the corresponding queries built using these two tables from the Teradata database ci_db. Make sure that ci_db is not an existing database in Amazon Redshift. We perform the following steps to walk through the migration process:

  1. Identify the collation in Teradata.
  2. Set up the Amazon Redshift database and schema.
  3. Set up the Amazon Redshift DDL.
  4. Load sample data.
  5. Test the Reporting queries for collation syntax.

Identify the collation in Teradata

In Teradata, based on the session mode, the default CHAR and VARCHAR column collation changes. If the Teradata session mode is in ANSI mode, the default is case-sensitive; if it’s in Teradata mode, it’s dependent on the DefaultCaseSpec parameter at cluster level. If DefaultCaseSpec parameter is TRUE, the default is case sensitive; if it’s FALSE, it’s case insensitive. By default, DefaultCaseSpec parameter is set to FALSE. We can override this default collation at the column and expression level.

SELECT transaction_mode FROM dbc.sessioninfo
   	WHERE sessionno = SESSION;

If the output is A, it’s in ANSI mode; if its T, it’s in Teradata mode. For this use case, let’s assume that the session is in Teradata mode.

To identify the collation override at the column level, run the following commands on the Teradata environment:

show table ci_db.invoice;
show table ci_db.customer;

We receive the following outputs:

CREATE MULTISET TABLE ci_db.invoice ,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      invoice_id VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,
      cust_nbr VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC
      )
PRIMARY INDEX ( invoice_id );


CREATE MULTISET TABLE ci_db.customer ,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      cust_nbr VARCHAR(100) CHARACTER SET LATIN CASESPECIFIC,
      cust_name VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC)
PRIMARY INDEX ( cust_nbr );

Set up the Amazon Redshift database and schema

In Amazon Redshift a database contains one or more named schemas. Each schema in a database contains tables and other kinds of named objects. By default, a database has a single schema, which is named PUBLIC. In general, we recommend to create each Teradata database as a schema in Amazon Redshift.

To create similar functionality of case insensitivity at the database level in Amazon Redshift, we create a new database ci_database with CASE_INSENSITIVE collation:

create database ci_database WITH COLLATE CASE_INSENSITIVE; 

After the database is created in Amazon Redshift, connect to the database by altering the database name from the default database to ci_database in the connection details. Let’s create the schema ci_db in the current database ci_database.

The following code creates the schema in Amazon Redshift and sets it as the default schema for the session:

create schema ci_db;
set search_path to ci_db;

Set up Amazon Redshift DDL

Based on the Teradata DDL output from earlier, we can observe the following:

  • The ci_db.invoice table has all the fields as case-insensitive.
  • The ci_db.customer table has cust_nbr as case-sensitive and cust_name as case-insensitive.

Because we created the case-insensitive database in Amazon Redshift, we need to mention the collation override at the column level for the case-sensitive columns only.

To attain similar collation logic in Amazon Redshift, create the tables with the following syntax:

create table ci_db.invoice 
(
invoice_id VARCHAR(100) ENCODE ZSTD
,cust_nbr VARCHAR(100) ENCODE ZSTD
)
DISTKEY ( invoice_id );



create table ci_db.customer 
(
cust_nbr VARCHAR(100) collate case_sensitive ENCODE ZSTD
,cust_name VARCHAR(100) ENCODE ZSTD 
)
DISTKEY ( cust_nbr );

Load sample data

In general, we recommend using the AWS SCT agent tool to migrate the data from Teradata to Amazon Redshift. For this post, let’s load the following sample data into the tables invoice and customer on Amazon Redshift. Run the following insert statements:

insert into ci_db.invoice (invoice_id ,cust_nbr) values ('inv1','a1');
insert into ci_db.invoice (invoice_id ,cust_nbr) values ('INV1','A1');
insert into ci_db.invoice (invoice_id ,cust_nbr) values ('inv2','b1');
insert into ci_db.invoice (invoice_id ,cust_nbr) values ('INV2','B1');

insert into ci_db.customer (cust_nbr,cust_name) values ( 'a1','John');
insert into ci_db.customer (cust_nbr,cust_name) values ( 'A1','David');
insert into ci_db.customer (cust_nbr,cust_name) values ( 'b1','Bob');
insert into ci_db.customer (cust_nbr,cust_name) values ( 'B1','Mary');

Test the Reporting queries for collation syntax

Let’s review the data that we loaded into both the tables:

select * from ci_db.invoice;

invoice_id | cust_nbr 
------------+----------
 inv1       | a1
 INV1       | A1
 inv2       | b1
 INV2       | B1
(4 rows)

select * from ci_db.customer;

cust_nbr | cust_name 
----------+-----------
 a1       | John 
 A1       | David
 b1       | Bob
 B1       | Mary
(4 rows)

Default collation column query

Run the following query on both environments and observe that both inv1 and INV1 invoices are returned because the invoice_id column has the default database collation (case-insensitive):

select * from ci_db.invoice where invoice_id ='inv1';

invoice_id | cust_nbr 
------------+----------
 inv1       | a1
 INV1       | A1
(2 rows)

Case-sensitive collation column query

Run the following query on both environments and observe that only the customer a1 is returned; the customer A1 is ignored because the cust_nbr field collation is case-sensitive:

select * from ci_db.customer where cust_nbr ='a1';

cust_nbr | cust_name 
----------+-----------
 a1       | John
(1 row)

Case-insensitive expression override collation query

Run the following query and observe that by performing an expression-level override to case-insensitive for the case-sensitive column cust_nbr it returns both customers a1 and A1.

The following is the Teradata syntax:

select * from ci_db.customer where cust_nbr (not casespecific) ='a1';
cust_nbr | cust_name 
----------+-----------
 a1       | John
 A1       | David
(2 rows)

The following is the Amazon Redshift syntax:

select * from ci_db.customer where collate(cust_nbr ,'case_insensitive') ='a1';

cust_nbr | cust_name 
----------+-----------
 a1       | John
 A1       | David
(2 rows)

Expression-level override in join condition query

When different collation columns are joined, in Teradata by default it performs a case-sensitive join. To achieve similar functionality in Amazon Redshift, we have to perform a case-sensitive expression-level override for the case-insensitive column. In the following example, the cust_nbr column is case-insensitive in the invoice table, whereas it’s case-sensitive in the customer table.

The following is the Teradata syntax:

select 
 inv.invoice_id
,inv.cust_nbr as cust_nbr_from_invoice
,cust.cust_nbr as cust_nbr_from_customer
,cust.cust_name  

from ci_db.invoice inv
inner join ci_db.customer cust 
on inv.cust_nbr = cust.cust_nbr
where cust.cust_nbr = 'a1'
order by inv.invoice_id;

invoice_id | cust_nbr_from_invoice | cust_nbr_from_customer | cust_name 
------------+-----------------------+------------------------+-----------
 inv1       | a1                    | a1                     | John
(1 row)

The following is the Amazon Redshift syntax:

select 
 inv.invoice_id
,inv.cust_nbr as cust_nbr_from_invoice
,cust.cust_nbr as cust_nbr_from_customer
,cust.cust_name  

from ci_db.invoice inv
inner join ci_db.customer cust 
on collate (inv.cust_nbr , 'case_sensitive') = cust.cust_nbr
where cust.cust_nbr = 'a1'
order by inv.invoice_id;

invoice_id | cust_nbr_from_invoice | cust_nbr_from_customer | cust_name 
------------+-----------------------+------------------------+-----------
 inv1       | a1                    | a1                     | John
(1 row)

Materialized views with column-level override:

To perform complex queries on large tables in Amazon Redshift, we can create materialized views to reduce the time it takes to compute the results. We can create materialized views on top of the Amazon Redshift tables and the column-level collations defined are honored. 

The following code creates a materialized view in Amazon Redshift:

create materialized view customer_mv AS
select collate(cust_nbr,'case_insensitive') as cust_nbr_ci  ,cust_name from ci_db.customer;

Run the following query and observe that both customers a1 and A1 are returned because the materialized view has the case-insensitive override at the materialized view level even though in the base table it’s a case-sensitive override:

SELECT * FROM customer_mv WHERE cust_nbr_ci ='a1';

cust_nbr_ci | cust_name 
------------+-----------
 a1         | John
 A1         | David
(2 rows)

Identify the column-level collations in Teradata

For Teradata to Amazon Redshift migrations, it’s important to identify the list of all columns with column-level override collation logic in Teradata. You can use the following query to identify the collation override at each column level:

select databasename,tablename,columnname,uppercaseflag from dbc.columns where databasename ='ci_db';

If Uppercaseflag shows as C, it’s a case-sensitive column; N means not case-sensitive.

Things to consider

  • All string comparison operators, like, order by and group by clauses, aggregate functions, window functions, and scalar functions, follow the database and column collation
  • If a function or an operation returns the VARCHAR or CHAR type and takes multiple inputs with different collation types (case-sensitive and case-insensitive), you should use the expression-level override
  • For external queries, including Amazon Redshift Spectrum and Amazon Aurora PostgreSQL, federated queries use database-level collation only.

For other details and limitations, see Amazon Redshift Database Collation documentation.

We recommend using AWS SCT to accelerate your Teradata migration to Amazon Redshift. Refer to this blog for more details.

Conclusion

This post demonstrated how to use collation for string processing at the database, column, and expression level in Amazon Redshift. We also walked through migrating existing Teradata database collations to Amazon Redshift syntax.

The Amazon Redshift collation feature for string processing reduces the effort required when migrating from traditional on-premises MPP data warehouses such as Teradata to Amazon Redshift without refactoring your application code. This feature also helps you achieve your performance goals using Amazon Redshift by keeping the on-premises default case-insensitive feature.

We hope you can take advantage of this new Amazon Redshift feature to migrate to the AWS Cloud for database freedom.


About the authors

Mengchu Cai is a principal engineer at AWS. He works on redshift query optimization, query performance and SQL functionality challenges.

 

 

 

Vamsi Bhadriraju is a Data Architect at AWS. He works closely with enterprise customers to build data lakes and analytical applications on the AWS Cloud.

 

 

 

Yichao Xue is a Software Engineer with Amazon Redshift Query Processing team. He enjoys solving challenging problems for different components of Redshift, including workload management, Redshift Spectrum, federated queries, and recently case-insensitive collation. Outside work, he likes reading, watching movies, listening to music and traveling around the world.