Accelerate your data warehouse migration to Amazon Redshift – Part 1

In this post (the first in a multi-part series), we describe new capabilities to automate your schema conversion, preserve your investment in existing scripts, reports, and applications, accelerate query performance, and reduce your overall cost to migrate to Amazon Redshift.

Check out all posts in this series:

Accelerate your data warehouse migration to Amazon Redshift – Part 1 to learn more about macro conversion, case-insensitive string comparison, and other new features
Accelerate your data warehouse migration to Amazon Redshift – Part 2 to learn about automation for proprietary data types
Accelerate your data warehouse migration to Amazon Redshift – Part 3 to learn about automation for proprietary SQL statements
Accelerate your data warehouse migration to Amazon Redshift – Part 4 to learn about new options for database scripting
Accelerate your data warehouse migration to Amazon Redshift – Part 5 to learn how to migrate set tables to Amazon Redshift
Accelerate your data warehouse migration to Amazon Redshift – Part 6 to learn how to manage uniqueness constraints like primary keys

Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate AWS services like Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to take advantage of all of the analytic capabilities in the AWS Cloud.

Many customers have asked for help migrating from self-managed data warehouse engines to Amazon Redshift. In these cases, you may have terabytes (or petabytes) of data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over years (or decades) of use.

Until now, migrating a data warehouse to AWS was complex and involved a significant amount of manual effort. You had to remediate syntax differences, inject code to replace proprietary features, and manually tune the performance of queries and reports.

Today, we’re happy to share some recent enhancements to Amazon Redshift and the AWS Schema Conversion Tool (AWS SCT) that make it easier to automate your migrations to Amazon Redshift. This is the first in a series of posts that introduce dozens of new features to Amazon Redshift and AWS SCT in the areas of scripting, data type support, performance, and SQL enhancements. We show examples of how to use the new features and provide links to relevant documentation so you can continue exploring these new capabilities.

In this post, we introduce five new features: automation for macro conversion, support for case-insensitive string collation, support for case-sensitive database identifiers, recursive common table expressions (WITH clauses), and automatic table optimization, which tunes your Amazon Redshift tables based on your query workload. We’ll use Teradata as an example data warehouse.

Let’s take a look at the highlights.

Macro conversion

Macros are a proprietary SQL extension. Essentially, macros are SQL statements that accept parameters and can be called from multiple entry points in your application code. You can think of macros as simple stored procedures. For example, consider the following Teradata table and macro:

CREATE MULTISET TABLE testschema.salaries(
  emp_id INTEGER FORMAT '-(10)9',
  amount DECIMAL(8,2) FORMAT '-------.99'
)
UNIQUE PRIMARY INDEX(emp_id);

The following macro updates the salaries of all employees. Because the UPDATE statement is encapsulated in a macro, it can be shared by multiple applications by simply calling the macro name:

REPLACE MACRO testschema.everybody_gets_a_raise (
  percentage DECIMAL(4,2)
) 
AS (
  UPDATE testschema.salaries 
  SET amount = amount * :percentage;
);

Because Amazon Redshift doesn’t natively support macros, you previously had to manually convert these statements into an Amazon Redshift equivalent. If you had hundreds or thousands of macros, this represented a significant migration cost.

We’re happy to share that AWS SCT can now automate this conversion for you. AWS SCT will convert this macro into an Amazon Redshift stored procedure. It will also convert any corresponding macro invocations into calls to the corresponding stored procedure.

For example, the preceding macro is converted to the following stored procedure in Amazon Redshift:

CREATE OR REPLACE PROCEDURE testschema.everybody_gets_a_raise(par_percentage NUMERIC(4,2))
AS $BODY$
BEGIN
    UPDATE testschema.salaries
    SET amount = amount * par_percentage;
END;
$BODY$
LANGUAGE plpgsql

Let’s see what happens to invocations of the macro. Suppose a BTEQ script calls the macro as follows:

-- *******************************************
--  NAME:    everybody_gets_a_raise.sql
--  PURPOSE: Give all employees a 10% raise
-- *******************************************

.logon ${LOG_INFO};

exec testschema.update_employee(10);
...

AWS SCT recognizes the macro invocation and converts it into a call to the Amazon Redshift stored procedure:

/* *******************************************
 NAME:    everybody_gets_a_raise.sql
 PURPOSE: Give all employees a 10% raise
 *******************************************
 */
\logon testschema ${LOG_INFO}

CALL testschema.update_employee (10)
...

You can try out macro conversion in the latest version of AWS SCT, available now.

Case-insensitive collation

ANSI-compliant string comparison is case-sensitive; an uppercase “A” is different from a lowercase “a.” Normally, string comparisons respect the case of the operands, so that “A” = “a” is FALSE.

Some databases also support case-insensitive string comparison. Here, “A” = “a” is TRUE, as if both operands are converted to lowercase (or uppercase) for the purposes of the comparison. For example, in a Teradata database, case-insensitive collation is the default semantics for sessions running in BTET mode, which is the default session mode for the engine.

In contrast, case-sensitive comparison is the default semantics in Amazon Redshift. Amazon Redshift uses the normal ANSI-compliant semantics by default.

Before, when converting case-insensitive code, you had to inject extra code to convert string values to the same case prior to comparing them. This works in the sense that the case of the original strings is ignored in the comparison, but query performance suffers, especially when the string values are part of join or predicate clauses.

Amazon Redshift now performs case-insensitive comparison natively as a feature of the database engine. With this new feature, you can enable case-insensitive collation when you define a new database, define a new column, or use a column in an expression.

Let’s look at how this works. First, create a new Amazon Redshift database and make it CASE_INSENSITIVE. Any columns that are created in the following code are case-insensitive by default:

testschema=# create database ci_database collate case_insensitive;
CREATE DATABASE

testschema=# \connect ci_database;
psql (11.0, server 8.0.2)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
You are now connected to database "ci_database" as user "<user>".

Now, let’s create a table with three columns. The first column has the default case sensitivity, which is CASE_INSENSITIVE, and the other two have their case sensitivity explicitly declared. We insert some data to test the functionality:

ci_database=# create table case_test (
  df_col varchar(20)
, cs_col varchar(20) collate case_sensitive
, ci_col varchar(20) collate case_insensitive
)
;
CREATE TABLE

ci_database=# insert into case_test values ('Hello', 'hello', 'HELLO');
INSERT 0 1

Let’s compare the default column and the explicitly declared case-insensitive column. Because the database default is CASE_INSENSITIVE, the comparison is case-insensitive and the strings match:

ci_database=# select 1 from case_test where df_col = ci_col;
 ?column? 
----------
 1
(1 row)

Similarly, you can override the case sensitivity of a column. In the following code, we override the case-sensitive column to be CASE_INSENSITIVE, and observe that the comparison matches again:

ci_database=# select 1 from case_test where df_col = collate(cs_col, 'case_insensitive');
 ?column? 
----------
 1
(1 row)

Lastly, we want to note that Amazon Redshift won’t let you directly compare a CASE_SENSITIVE column to a CASE_INSENSITIVE column (as in the following code).

ci_database=# select 1 from case_test where cs_col = ci_col;
ERROR: Query with different collations is not supported yet.
DETAIL:
-----------------------------------------------
error: Query with different collations is not supported yet.
code: 1020
context:
query: 0
location: parse_expr.c:613
process: padbmaster [pid=17580]

To avoid this, make sure you explicitly override the collation of one or both operands appropriately. This is a best practice for your SQL code—it will be easier to understand when collation is explicitly applied.

Now, let’s look at how AWS SCT can support to your database conversions. Suppose you have Teradata tables with CASESPECIFIC and NOT CASESPECIFIC column specifications.

You can also use AWS SCT to convert any expressions that use CASESPECIFIC or NOT CASESPECIFIC overrides. For example, you can run a SELECT statement that performs a case-insensitive filter as follows:

CREATE MACRO case_test_macro (string_par CHAR (10)) AS (
SELECT *
FROM ci_database.case_test
WHERE :string_par (not casespecific) = ci_col;
)
;

When you use AWS SCT to convert the code to Amazon Redshift, it preserves the case insensitivity of the filter in the converted code:

CREATE OR REPLACE PROCEDURE ci_database.case_test_macro
(par_string_par CHARACTER VARYING(10), macro_out INOUT refcursor)
AS $BODY$
BEGIN
    OPEN macro_out FOR
    SELECT *
    FROM ci_database.case_test
    WHERE collate(par_string_par, 'case_insensitive') = ci_col;
END;
$BODY$
LANGUAGE plpgsql;

How do you make this work in AWS SCT? You simply navigate to the object in the source tree of SCT, choose it (right-click), and choose Convert schema. In this case, the object is a macro, and it’s found in the Macros subtree in the source pane.

After converting, the object appears in the target tree, ready to be applied on Amazon Redshift.

We’re happy to introduce case-sensitive collation in Amazon Redshift, and expect it to be a big hit with customers. Amazon Redshift lets you define case sensitivity at the database, column, and expression level, and you can use AWS SCT to automatically convert case-insensitive SQL from Amazon Redshift.

Case-insensitive collation is available to use now in Amazon Redshift. To learn more, see Create database and Create table in the Amazon Redshift developers guide.

Case-sensitive identifiers

By default, Amazon Redshift identifiers are case-insensitive. The database forces lowercasing of table names and other identifiers, even if uppercase letters are used.

Many customers have asked us to support mixed-case identifiers, and we’re happy to announce support for this feature. You can now control whether identifiers are interpreted as mixed-case or not at both the cluster and session level.

Let’s enable case-sensitive identifiers at the session level:

testschema=# set enable_case_sensitive_identifier to on;
SET

Now, we create a table that has mixed case identifiers, which are enclosed in quotes. This is a requirement to instruct Amazon Redshift to preserve the case of the identifiers.

testschema=# create table testschema."MixedCaseTable" ("MCcol" integer, "lccol" varchar(100), "uccol" varchar(100));
CREATE TABLE

Now any references to the table must use the identifier with the case as defined in the declaration. The reference must be quoted to ensure Amazon Redshift preserves the case of the identifier reference:

testschema=# select * from testschema."MixedCaseTable";
 MCcol | lccol | uccol 
-------+-------+-------
(0 rows)

Identifiers that aren’t enclosed in quotes are interpreted as not case-sensitive. This is the default behavior in Amazon Redshift. See the following code:

testschema=# select * from testschema.MixedCaseTable;
ERROR: relation "testschema.mixedcasetable" does not exist

testschema=# select mccol from testschema."MixedCaseTable";
ERROR: column "mccol" does not exist in MixedCaseTable

testschema=# select "MCcol" from testschema."MixedCaseTable";
 MCcol 
-------
(0 rows)

Case-sensitive identifiers are available now in Amazon Redshift. For more information, see Names and identifiers in the Amazon Redshift developers guide.

Recursive common table expressions

Common table expressions (CTEs) are a convenient way to encapsulate query logic in large, complex SQL statements. Syntactically, CTEs are defined using the WITH clause; the main query uses the CTE by referencing it in a FROM clause.

Amazon Redshift now supports recursive CTEs. A recursive CTE is useful in querying hierarchical data, such as organization charts that show reporting relationships between employees and managers.

For example, consider the following Amazon Redshift table that contains employees and their managers:

CREATE TABLE IF NOT EXISTS testschema.employees(
  id INTEGER
, name CHARACTER VARYING(20) ENCODE ZSTD
, manager_id INTEGER ENCODE AZ64
)
DISTSTYLE KEY
DISTKEY
(id)
SORTKEY
(id);

myredshift=# select * from testschema.employees;

  id | name    | manager_id 
 101 | Bob     | 100
 100 | Caitlin | 
 102 | Renee   | 101
 103 | John    | 101
 110 | Mark    | 201
 201 | Ana     | 104
 106 | Jeff    | 102
 111 | Phyllis | 103
 105 | David   | 103
 112 | Britney | 201
 104 | Chris   | 103
(11 rows)

Let’s write a query to show all the employees that are in John’s organization. We create a recursive CTE that finds all employees that either report to John or report to a manager in John’s chain of command. For simplicity, we cap the recursion after four levels of management:

CREATE OR REPLACE view testschema.johns_employees AS
WITH RECURSIVE johns_org (id, name, manager_id, level) AS ( 
  SELECT 
    id
  , name
  , manager_id 
  , 1 AS level
  FROM 
    testschema.employees
  WHERE 
    id = 103 
  AND name = 'John'
 UNION ALL
  SELECT 
    e.id
  , e.name
  , e.manager_id 
  , level + 1
  FROM 
    testschema.employees e
  , johns_org jo
  WHERE 
      e.manager_id = jo.id
  AND level < 4
 )
SELECT id, name, manager_id FROM johns_org;

The CTE has two parts. The first part retrieves John’s information. This is the “base case” of the recursion. The second part recursively adds the employees who report to some manager in John’s organization. The result set is as follows:

myredshift=# select * from testschema.johns_employees;

id  | name     | manager_id
111 | Phyllis  | 103
103 | John     | 101
201 | Ana      | 104
104 | Chris    | 103
112 | Britney  | 201
105 | David    | 103
110 | Mark     | 201

We’re also happy to share that AWS SCT automatically converts queries with recursive common table expressions. For example, if you create a view with the query for John’s organization, you can convert that view using AWS SCT, as in the following screenshot.

Recursive CTEs are available now in Amazon Redshift. For more information, see WITH clause in the Amazon Redshift Developer Guide. You can download the latest version of AWS SCT to try out the query conversion feature.

Automatic table optimization

Choosing the right distribution and sort keys for your Amazon Redshift tables improves the performance of your queries and reports. Previously, this was done by creating tables with no keys and relying on recommendations from Amazon Redshift Advisor, or by making assumptions on what might be the right keys, based on domain knowledge of the source workload.

Amazon Redshift now supports automatic table optimization (ATO). With ATO, Amazon Redshift monitors the workload running on the database and automatically selects the best distribution and sort keys for your tables. When new keys are identified, the database manages the process of reorganizing the tables seamlessly.

Amazon Redshift gives you a lot of flexibility in how you can use ATO. You can choose ATO when you create your tables, so that Amazon Redshift monitors then chooses the keys after a sufficient number of queries have been run. Or you can specify the initial distribution and sort keys to use for a given table and have ATO monitor and reorganize afterwards. Or you can choose not to use ATO at all and stick with the initial distribution and sort keys you specify when the table was created.

For example, you can create a table without specifying any initial distribution style or sort keys, as in the following code. In this case, Amazon Redshift reorganizes the table after a sufficient number of queries have run against the table.

testschema=# create table testschema.ato_1 (a integer) diststyle auto sortkey auto;
CREATE TABLE

Or you can specify an initial distribution style and sort keys and instruct Amazon Redshift to use ATO, as in the following code. Amazon Redshift creates the table with the given distribution style and sort keys and reorganizes the table after sufficient queries have been run.

testschema=# create table testschema.ato_2 (a integer) diststyle key distkey (a) sortkey (a);
CREATE TABLE
testschema=# alter table testschema.ato_2 alter diststyle auto;
ALTER TABLE
testschema=# alter table testschema.ato_2 alter sortkey auto;
ALTER TABLE

We’ve also enhanced AWS SCT to apply ATO settings when you convert your tables. SCT supports all of the options we’ve described. You can choose to apply initial distribution and sort keys, or not, or choose to apply ATO or not.

For example, you can select Use Redshift automatic table tuning to enable ATO and select None as the initial key selection strategy.

In this case, AWS SCT sets the distribution style and sort keys to AUTO.

Or you can let AWS SCT choose an initial distribution style and sort keys and also enable ATO. In this case, the table is created with a distribution style and sort keys that AWS SCT determines, and the table is altered to enable ATO to change these settings after a sufficient query workload has passed.

In this case, we chose Use metadata, ignore statistical information to instruct AWS SCT to select keys based on the presence of indexes or other access paths in the source database.

Lastly, if you’re confident in your key selection, you can disable ATO entirely by unselecting Use Redshift automatic table tuning and choosing one of the three optimization strategies provided by AWS SCT.

ATO and conversion support for data warehouse migrations are available now in Amazon Redshift and AWS SCT. For more information, see Working with Automatic table optimization and Optimizing tables in Amazon Redshift using Automatic Table Optimization.

Conclusion

We’re happy to share these new features with you. If you’re contemplating a data warehouse migration to Amazon Redshift, these capabilities will help automate your schema conversion, preserve your investment in existing reports, applications, and ETL, and accelerate query performance on your migrated data.

This post described a few of the dozens of new features we’re introducing to automate your data warehouse migrations to AWS. We will share more in upcoming posts. You’ll hear about additional SQL automation, a purpose-built scripting language for Amazon Redshift with BTEQ compatibility, and automated support for other data warehouse features.

Check back soon for the next post in the series. Until then, you can learn more about Amazon Redshift and the AWS Schema Conversion Tool. Happy migrating!

About the author

Michael Soo is a database engineer with the AWS DMS and AWS SCT team at Amazon Web Services.

AWS Database Blog