Accelerate your data warehouse migration to Amazon Redshift – Part 1
In this post (the first in a multi-part series), we describe new capabilities to automate your schema conversion, preserve your investment in existing scripts, reports, and applications, accelerate query performance, and reduce your overall cost to migrate to Amazon Redshift.
|Check out all posts in this series:
Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate AWS services like Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to take advantage of all of the analytic capabilities in the AWS Cloud.
Many customers have asked for help migrating from self-managed data warehouse engines to Amazon Redshift. In these cases, you may have terabytes (or petabytes) of data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over years (or decades) of use.
Until now, migrating a data warehouse to AWS was complex and involved a significant amount of manual effort. You had to remediate syntax differences, inject code to replace proprietary features, and manually tune the performance of queries and reports.
Today, we’re happy to share some recent enhancements to Amazon Redshift and the AWS Schema Conversion Tool (AWS SCT) that make it easier to automate your migrations to Amazon Redshift. This is the first in a series of posts that introduce dozens of new features to Amazon Redshift and AWS SCT in the areas of scripting, data type support, performance, and SQL enhancements. We show examples of how to use the new features and provide links to relevant documentation so you can continue exploring these new capabilities.
In this post, we introduce five new features: automation for macro conversion, support for case-insensitive string collation, support for case-sensitive database identifiers, recursive common table expressions (WITH clauses), and automatic table optimization, which tunes your Amazon Redshift tables based on your query workload. We’ll use Teradata as an example data warehouse.
Let’s take a look at the highlights.
Macros are a proprietary SQL extension. Essentially, macros are SQL statements that accept parameters and can be called from multiple entry points in your application code. You can think of macros as simple stored procedures. For example, consider the following Teradata table and macro:
The following macro updates the salaries of all employees. Because the UPDATE statement is encapsulated in a macro, it can be shared by multiple applications by simply calling the macro name:
Because Amazon Redshift doesn’t natively support macros, you previously had to manually convert these statements into an Amazon Redshift equivalent. If you had hundreds or thousands of macros, this represented a significant migration cost.
We’re happy to share that AWS SCT can now automate this conversion for you. AWS SCT will convert this macro into an Amazon Redshift stored procedure. It will also convert any corresponding macro invocations into calls to the corresponding stored procedure.
For example, the preceding macro is converted to the following stored procedure in Amazon Redshift:
Let’s see what happens to invocations of the macro. Suppose a BTEQ script calls the macro as follows:
AWS SCT recognizes the macro invocation and converts it into a call to the Amazon Redshift stored procedure:
You can try out macro conversion in the latest version of AWS SCT, available now.
ANSI-compliant string comparison is case-sensitive; an uppercase “A” is different from a lowercase “a.” Normally, string comparisons respect the case of the operands, so that “A” = “a” is
Some databases also support case-insensitive string comparison. Here, “A” = “a” is
TRUE, as if both operands are converted to lowercase (or uppercase) for the purposes of the comparison. For example, in a Teradata database, case-insensitive collation is the default semantics for sessions running in BTET mode, which is the default session mode for the engine.
In contrast, case-sensitive comparison is the default semantics in Amazon Redshift. Amazon Redshift uses the normal ANSI-compliant semantics by default.
Before, when converting case-insensitive code, you had to inject extra code to convert string values to the same case prior to comparing them. This works in the sense that the case of the original strings is ignored in the comparison, but query performance suffers, especially when the string values are part of join or predicate clauses.
Amazon Redshift now performs case-insensitive comparison natively as a feature of the database engine. With this new feature, you can enable case-insensitive collation when you define a new database, define a new column, or use a column in an expression.
Let’s look at how this works. First, create a new Amazon Redshift database and make it
CASE_INSENSITIVE. Any columns that are created in the following code are case-insensitive by default:
Now, let’s create a table with three columns. The first column has the default case sensitivity, which is
CASE_INSENSITIVE, and the other two have their case sensitivity explicitly declared. We insert some data to test the functionality:
Let’s compare the default column and the explicitly declared case-insensitive column. Because the database default is
CASE_INSENSITIVE, the comparison is case-insensitive and the strings match:
Similarly, you can override the case sensitivity of a column. In the following code, we override the case-sensitive column to be
CASE_INSENSITIVE, and observe that the comparison matches again:
Lastly, we want to note that Amazon Redshift won’t let you directly compare a
CASE_SENSITIVE column to a
CASE_INSENSITIVE column (as in the following code).
To avoid this, make sure you explicitly override the collation of one or both operands appropriately. This is a best practice for your SQL code—it will be easier to understand when collation is explicitly applied.
Now, let’s look at how AWS SCT can support to your database conversions. Suppose you have Teradata tables with
NOT CASESPECIFIC column specifications.
You can also use AWS SCT to convert any expressions that use
NOT CASESPECIFIC overrides. For example, you can run a SELECT statement that performs a case-insensitive filter as follows:
When you use AWS SCT to convert the code to Amazon Redshift, it preserves the case insensitivity of the filter in the converted code:
How do you make this work in AWS SCT? You simply navigate to the object in the source tree of SCT, choose it (right-click), and choose Convert schema. In this case, the object is a macro, and it’s found in the Macros subtree in the source pane.
After converting, the object appears in the target tree, ready to be applied on Amazon Redshift.
We’re happy to introduce case-sensitive collation in Amazon Redshift, and expect it to be a big hit with customers. Amazon Redshift lets you define case sensitivity at the database, column, and expression level, and you can use AWS SCT to automatically convert case-insensitive SQL from Amazon Redshift.
By default, Amazon Redshift identifiers are case-insensitive. The database forces lowercasing of table names and other identifiers, even if uppercase letters are used.
Many customers have asked us to support mixed-case identifiers, and we’re happy to announce support for this feature. You can now control whether identifiers are interpreted as mixed-case or not at both the cluster and session level.
Let’s enable case-sensitive identifiers at the session level:
Now, we create a table that has mixed case identifiers, which are enclosed in quotes. This is a requirement to instruct Amazon Redshift to preserve the case of the identifiers.
Now any references to the table must use the identifier with the case as defined in the declaration. The reference must be quoted to ensure Amazon Redshift preserves the case of the identifier reference:
Identifiers that aren’t enclosed in quotes are interpreted as not case-sensitive. This is the default behavior in Amazon Redshift. See the following code:
Case-sensitive identifiers are available now in Amazon Redshift. For more information, see Names and identifiers in the Amazon Redshift developers guide.
Recursive common table expressions
Common table expressions (CTEs) are a convenient way to encapsulate query logic in large, complex SQL statements. Syntactically, CTEs are defined using the WITH clause; the main query uses the CTE by referencing it in a FROM clause.
Amazon Redshift now supports recursive CTEs. A recursive CTE is useful in querying hierarchical data, such as organization charts that show reporting relationships between employees and managers.
For example, consider the following Amazon Redshift table that contains employees and their managers:
Let’s write a query to show all the employees that are in John’s organization. We create a recursive CTE that finds all employees that either report to John or report to a manager in John’s chain of command. For simplicity, we cap the recursion after four levels of management:
The CTE has two parts. The first part retrieves John’s information. This is the “base case” of the recursion. The second part recursively adds the employees who report to some manager in John’s organization. The result set is as follows:
We’re also happy to share that AWS SCT automatically converts queries with recursive common table expressions. For example, if you create a view with the query for John’s organization, you can convert that view using AWS SCT, as in the following screenshot.
Recursive CTEs are available now in Amazon Redshift. For more information, see WITH clause in the Amazon Redshift Developer Guide. You can download the latest version of AWS SCT to try out the query conversion feature.
Automatic table optimization
Choosing the right distribution and sort keys for your Amazon Redshift tables improves the performance of your queries and reports. Previously, this was done by creating tables with no keys and relying on recommendations from Amazon Redshift Advisor, or by making assumptions on what might be the right keys, based on domain knowledge of the source workload.
Amazon Redshift now supports automatic table optimization (ATO). With ATO, Amazon Redshift monitors the workload running on the database and automatically selects the best distribution and sort keys for your tables. When new keys are identified, the database manages the process of reorganizing the tables seamlessly.
Amazon Redshift gives you a lot of flexibility in how you can use ATO. You can choose ATO when you create your tables, so that Amazon Redshift monitors then chooses the keys after a sufficient number of queries have been run. Or you can specify the initial distribution and sort keys to use for a given table and have ATO monitor and reorganize afterwards. Or you can choose not to use ATO at all and stick with the initial distribution and sort keys you specify when the table was created.
For example, you can create a table without specifying any initial distribution style or sort keys, as in the following code. In this case, Amazon Redshift reorganizes the table after a sufficient number of queries have run against the table.
Or you can specify an initial distribution style and sort keys and instruct Amazon Redshift to use ATO, as in the following code. Amazon Redshift creates the table with the given distribution style and sort keys and reorganizes the table after sufficient queries have been run.
We’ve also enhanced AWS SCT to apply ATO settings when you convert your tables. SCT supports all of the options we’ve described. You can choose to apply initial distribution and sort keys, or not, or choose to apply ATO or not.
For example, you can select Use Redshift automatic table tuning to enable ATO and select None as the initial key selection strategy.
In this case, AWS SCT sets the distribution style and sort keys to AUTO.
Or you can let AWS SCT choose an initial distribution style and sort keys and also enable ATO. In this case, the table is created with a distribution style and sort keys that AWS SCT determines, and the table is altered to enable ATO to change these settings after a sufficient query workload has passed.
In this case, we chose Use metadata, ignore statistical information to instruct AWS SCT to select keys based on the presence of indexes or other access paths in the source database.
Lastly, if you’re confident in your key selection, you can disable ATO entirely by unselecting Use Redshift automatic table tuning and choosing one of the three optimization strategies provided by AWS SCT.
ATO and conversion support for data warehouse migrations are available now in Amazon Redshift and AWS SCT. For more information, see Working with Automatic table optimization and Optimizing tables in Amazon Redshift using Automatic Table Optimization.
We’re happy to share these new features with you. If you’re contemplating a data warehouse migration to Amazon Redshift, these capabilities will help automate your schema conversion, preserve your investment in existing reports, applications, and ETL, and accelerate query performance on your migrated data.
This post described a few of the dozens of new features we’re introducing to automate your data warehouse migrations to AWS. We will share more in upcoming posts. You’ll hear about additional SQL automation, a purpose-built scripting language for Amazon Redshift with BTEQ compatibility, and automated support for other data warehouse features.
About the author
Michael Soo is a database engineer with the AWS DMS and AWS SCT team at Amazon Web Services.