Case-insensitive collation support for string processing in Amazon Redshift
Amazon Redshift is a fast, fully managed, cloud-native data warehouse. Tens of thousands of customers have successfully migrated their workloads to Amazon Redshift. We hear from customers that they need case-insensitive collation for strings in Amazon Redshift in order to maintain the same functionality and meet their performance goals when they migrate their existing workloads from legacy, on-premises data warehouses like Teradata, Oracle, or IBM. With that goal in mind, AWS provides an option to create case-insensitive and case-sensitive collation.
In this post, we discuss how to use case-insensitive collation and how to override the default collation. Also, we specifically explain the process to migrate your existing Teradata database using the native Amazon Redshift collation capability.
What is collation?
Collation is a set of rules that tells a database engine how to compare and sort the CHAR and VARCHAR columns data in SQL. A case-insensitive collation ignores the differences between uppercase and lowercase letters for string comparison and sorting, whereas a case-sensitive collation does not. For example, in case-insensitive collation, “A” and “a” are equal. Different operations such as LIKE predicates, group by, order by, Regex, similar to behave based on the collation defined while the stored data keeps its original case.
We can define collation at three levels:
You can specify collation when creating a new database. All VARCHAR and CHAR columns in the current database pick up the database-level collation as default if no column-level override exists. If you don’t specify collation for a new database, the default collation continues to be the current semantic of case-sensitive.
To create a database with collation in Amazon Redshift, use the following syntax:
To alter a database with collation in Amazon Redshift, use the following syntax:
This action only works when the database is empty. Therefore, it’s important to use collation as a first step before any database objects are created.
To find the collation of an existing database, use the following query.
This provides output as
You can specify collation for a VARCHAR or CHAR column when creating a new table. This overrides the database-level collation.
To define a table with column-level collation in Amazon Redshift, use the following syntax:
To find the collation defined at the individual column level, use the following query:
You can use the collate function in a SQL query to override collation at the column level and database level.
To use the collate function in Amazon Redshift, use the following syntax. This example code converts
col1 to case-insensitive and compares with the value
In the following use case, we discuss how to convert a legacy Teradata database’s collation syntax to Amazon Redshift syntax. In Teradata, based on the session mode, the default CHAR and VARCHAR columns collation changes. We can override this default collation at the column and expression level. In Amazon Redshift, we can define collation at the database, column, and expression levels.
In this use case, let’s consider the migration of two tables,
customer, and the corresponding queries built using these two tables from the Teradata database
ci_db. Make sure that
ci_db is not an existing database in Amazon Redshift. We perform the following steps to walk through the migration process:
- Identify the collation in Teradata.
- Set up the Amazon Redshift database and schema.
- Set up the Amazon Redshift DDL.
- Load sample data.
- Test the Reporting queries for collation syntax.
Identify the collation in Teradata
In Teradata, based on the session mode, the default CHAR and VARCHAR column collation changes. If the Teradata session mode is in ANSI mode, the default is case-sensitive; if it’s in Teradata mode, it’s dependent on the
DefaultCaseSpec parameter at cluster level. If
DefaultCaseSpec parameter is TRUE, the default is case sensitive; if it’s FALSE, it’s case insensitive. By default,
DefaultCaseSpec parameter is set to FALSE. We can override this default collation at the column and expression level.
If the output is
A, it’s in ANSI mode; if its
T, it’s in Teradata mode. For this use case, let’s assume that the session is in Teradata mode.
To identify the collation override at the column level, run the following commands on the Teradata environment:
We receive the following outputs:
Set up the Amazon Redshift database and schema
In Amazon Redshift a database contains one or more named schemas. Each schema in a database contains tables and other kinds of named objects. By default, a database has a single schema, which is named PUBLIC. In general, we recommend to create each Teradata database as a schema in Amazon Redshift.
To create similar functionality of case insensitivity at the database level in Amazon Redshift, we create a new database
After the database is created in Amazon Redshift, connect to the database by altering the database name from the default database to
ci_database in the connection details. Let’s create the schema
ci_db in the current database
The following code creates the schema in Amazon Redshift and sets it as the default schema for the session:
Set up Amazon Redshift DDL
Based on the Teradata DDL output from earlier, we can observe the following:
ci_db.invoicetable has all the fields as case-insensitive.
cust_nbras case-sensitive and
Because we created the case-insensitive database in Amazon Redshift, we need to mention the collation override at the column level for the case-sensitive columns only.
To attain similar collation logic in Amazon Redshift, create the tables with the following syntax:
Load sample data
In general, we recommend using the AWS SCT agent tool to migrate the data from Teradata to Amazon Redshift. For this post, let’s load the following sample data into the tables
customer on Amazon Redshift. Run the following insert statements:
Test the Reporting queries for collation syntax
Let’s review the data that we loaded into both the tables:
Default collation column query
Run the following query on both environments and observe that both
INV1 invoices are returned because the
invoice_id column has the default database collation (case-insensitive):
Case-sensitive collation column query
Run the following query on both environments and observe that only the customer
a1 is returned; the customer
A1 is ignored because the
cust_nbr field collation is case-sensitive:
Case-insensitive expression override collation query
Run the following query and observe that by performing an expression-level override to case-insensitive for the case-sensitive column
cust_nbr it returns both customers
The following is the Teradata syntax:
The following is the Amazon Redshift syntax:
Expression-level override in join condition query
When different collation columns are joined, in Teradata by default it performs a case-sensitive join. To achieve similar functionality in Amazon Redshift, we have to perform a case-sensitive expression-level override for the case-insensitive column. In the following example, the
cust_nbr column is case-insensitive in the invoice table, whereas it’s case-sensitive in the customer table.
The following is the Teradata syntax:
The following is the Amazon Redshift syntax:
Materialized views with column-level override:
To perform complex queries on large tables in Amazon Redshift, we can create materialized views to reduce the time it takes to compute the results. We can create materialized views on top of the Amazon Redshift tables and the column-level collations defined are honored.
The following code creates a materialized view in Amazon Redshift:
Run the following query and observe that both customers
A1 are returned because the materialized view has the case-insensitive override at the materialized view level even though in the base table it’s a case-sensitive override:
Identify the column-level collations in Teradata
For Teradata to Amazon Redshift migrations, it’s important to identify the list of all columns with column-level override collation logic in Teradata. You can use the following query to identify the collation override at each column level:
Uppercaseflag shows as
C, it’s a case-sensitive column;
N means not case-sensitive.
Things to consider
- All string comparison operators, like, order by and group by clauses, aggregate functions, window functions, and scalar functions, follow the database and column collation
- If a function or an operation returns the VARCHAR or CHAR type and takes multiple inputs with different collation types (case-sensitive and case-insensitive), you should use the expression-level override
- For external queries, including Amazon Redshift Spectrum and Amazon Aurora PostgreSQL, federated queries use database-level collation only.
For other details and limitations, see Amazon Redshift Database Collation documentation.
We recommend using AWS SCT to accelerate your Teradata migration to Amazon Redshift. Refer to this blog for more details.
This post demonstrated how to use collation for string processing at the database, column, and expression level in Amazon Redshift. We also walked through migrating existing Teradata database collations to Amazon Redshift syntax.
The Amazon Redshift collation feature for string processing reduces the effort required when migrating from traditional on-premises MPP data warehouses such as Teradata to Amazon Redshift without refactoring your application code. This feature also helps you achieve your performance goals using Amazon Redshift by keeping the on-premises default case-insensitive feature.
We hope you can take advantage of this new Amazon Redshift feature to migrate to the AWS Cloud for database freedom.
About the authors
Mengchu Cai is a principal engineer at AWS. He works on redshift query optimization, query performance and SQL functionality challenges.
Vamsi Bhadriraju is a Data Architect at AWS. He works closely with enterprise customers to build data lakes and analytical applications on the AWS Cloud.
Yichao Xue is a Software Engineer with Amazon Redshift Query Processing team. He enjoys solving challenging problems for different components of Redshift, including workload management, Redshift Spectrum, federated queries, and recently case-insensitive collation. Outside work, he likes reading, watching movies, listening to music and traveling around the world.