AWS Big Data Blog
Accelerate your data warehouse migration to Amazon Redshift – Part 3
This is the third post in a multi-part series. We’re excited to share dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and reduce your overall cost to migrate to Amazon Redshift.
Check out all the posts in this series:
|
Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other services such as Amazon EMR, Amazon Athena, and Amazon SageMaker to use all the analytic capabilities in the AWS Cloud.
Many customers have asked for help migrating from self-managed data warehouse engines, like Teradata, to Amazon Redshift. In these cases, you may have terabytes (or petabytes) of historical data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over years (or decades) of use.
Until now, migrating a Teradata data warehouse to AWS was complex and involved a significant amount of manual effort.
Today, we’re happy to share recent enhancements to Amazon Redshift and the AWS Schema Conversion Tool (AWS SCT) that make it easier to automate your Teradata to Amazon Redshift migrations.
In this post, we introduce new automation for merge statements, a native function to support ASCII character conversion, enhanced error checking for string to date conversion, enhanced support for Teradata cursors and identity columns, automation for ANY and SOME predicates, automation for RESET WHEN clauses, automation for two proprietary Teradata functions (TD_NORMALIZE_OVERLAP and TD_UNPIVOT), and automation to support analytic functions (QUANTILE and QUALIFY).
Merge statement
Like its name implies, the merge statement takes an input set and merges it into a target table. If an input row already exists in the target table (a row in the target table has the same primary key value), then the target row is updated. If there is no matching target row, the input row is inserted into the table.
Until now, if you used merge statements in your workload, you were forced to manually rewrite the merge statement to run on Amazon Redshift. Now, we’re happy to share that AWS SCT automates this conversion for you. AWS SCT decomposes a merge statement into an update on existing records followed by an insert for new records.
Let’s look at an example. We create two tables in Teradata: a target table, employee
, and a delta table, employee_delta
, where we stage the input rows:
Now we create a Teradata merge statement that updates a row if it exists in the target, otherwise it inserts the new row. We embed this merge statement into a macro so we can show you the conversion process later.
Now we use AWS SCT to convert the macro. (See Accelerate your data warehouse migration to Amazon Redshift – Part 1 for details on macro conversion.) AWS SCT creates a stored procedure that contains an update (to implement the WHEN MATCHED condition) and an insert (to implement the WHEN NOT MATCHED condition).
This example showed how to use merge automation for macros, but you can convert merge statements in any application context: stored procedures, BTEQ scripts, Java code, and more. Download the latest version of AWS SCT and try it out.
ASCII() function
The ASCII function takes as input a string and returns the ASCII code, or more precisely, the UNICODE code point, of the first character in the string. Previously, Amazon Redshift supported ASCII as a leader-node only function, which prevented its use with user-defined tables.
We’re happy to share that the ASCII function is now available on Amazon Redshift compute nodes and can be used with user-defined tables. In the following code, we create a table with some string data:
Now you can use the ASCII function on the string columns:
Lastly, if your application code uses the ASCII function, AWS SCT automatically converts any such function calls to Amazon Redshift.
The ASCII feature is available now—try it out in your own cluster.
TO_DATE() function
The TO_DATE function converts a character string into a DATE value. A quirk of this function is that it can accept a string value that isn’t a valid date and translate it into a valid date.
For example, consider the string 2021-06-31
. This isn’t a valid date because the month of June has only 30 days. However, the TO_DATE function accepts this string and returns the “31st” day of June (July 1):
Customers have asked for strict input checking for TO_DATE, and we’re happy to share this new capability. Now, you can include a Boolean value in the function call that turns on strict checking:
You can turn off strict checking explicitly as well:
Also, the Boolean value is optional. If you don’t include it, strict checking is turned off, and you see the same behavior as before the feature was launched.
You can learn more about the TO_DATE function and try out strict date checking in Amazon Redshift now.
CURSOR result sets
A cursor is a programming language construct that applications use to manipulate a result set one row at a time. Cursors are more relevant for OLTP applications, but some legacy applications built on data warehouses also use them.
Teradata provides a diverse set of cursor configurations. Amazon Redshift supports a more streamlined set of cursor features.
Based on customer feedback, we’ve added automation to support Teradata WITH RETURN cursors. These types of cursors are opened within stored procedures and returned to the caller for processing of the result set. AWS SCT will convert a WITH RETURN cursor to an Amazon Redshift REFCURSOR.
For example, consider the following procedure, which contains a WITH RETURN cursor. The procedure opens the cursor and returns the result to the caller as a DYNAMIC RESULT SET:
AWS SCT converts the procedure as follows. An additional parameter is added to the procedure signature to pass the REFCURSOR:
IDENTITY columns
Teradata supports several non-ANSI compliant features for IDENTITY columns. We have enhanced AWS SCT to automatically convert these features to Amazon Redshift, whenever possible.
Specifically, AWS SCT now converts the Teradata START WITH and INCREMENT BY clauses to the Amazon Redshift SEED and STEP clauses, respectively. For example, consider the following Teradata table:
The GENERATED ALWAYS clause indicates that the column is always populated automatically—a value can’t be explicitly inserted or updated into the column. The START WITH clause defines the first value to be inserted into the column, and the INCREMENT BY clause defines the next value to insert into the column.
When you convert this table using AWS SCT, the following Amazon Redshift DDL is produced. Notice that the START WITH and INCREMENT BY values are preserved in the target syntax:
Also, by default, an IDENTITY column in Amazon Redshift only contains auto-generated values, so that the GENERATED ALWAYS property in Teradata is preserved:
IDENTITY columns in Teradata can also be specified as GENERATED BY DEFAULT. In this case, a value can be explicitly defined in an INSERT statement. If no value is specified, the column is filled with an auto-generated value like normal. Before, AWS SCT didn’t support conversion for GENERATED BY DEFAULT columns. Now, we’re happy to share that AWS SCT automatically converts such columns for you.
For example, the following table contains an IDENTITY column that is GENERATED BY DEFAULT:
The IDENTITY column is converted by AWS SCT as follows. The converted column uses the Amazon Redshift GENERATED BY DEFAULT clause:
There is one additional syntax issue that requires attention. In Teradata, an auto-generated value is inserted when NULL is specified for the column value:
Amazon Redshift uses a different syntax for the same purpose. Here, you include the keyword DEFAULT in the values list to indicate that the column should be auto-generated:
We’re happy to share that AWS SCT automatically converts the Teradata syntax for INSERT statements like the preceding example. For example, consider the following Teradata macro:
AWS SCT removes the NULL and replaces it with DEFAULT:
IDENTITY column automation is available now in AWS SCT. You can download the latest version and try it out.
ANY and SOME filters with inequality predicates
The ANY and SOME filters determine if a predicate applies to one or more values in a list. For example, in Teradata, you can use <> ANY to find all employees who don’t work for a certain manager:
Of course, you can rewrite this query using a simple not equal filter, but you often see queries from third-party SQL generators that follow this pattern.
Amazon Redshift doesn’t support this syntax natively. Before, any queries using this syntax had to be manually converted. Now, we’re happy to share that AWS SCT automatically converts ANY and SOME clauses with inequality predicates. The macro above is converted to a stored procedure as follows.
If the values list following the ANY contains two more values, AWS SCT will convert this to a series of OR conditions, one for each element in the list.
ANY/SOME filter conversion is available now in AWS SCT. You can try it out in the latest version of the application.
Analytic functions with RESET WHEN
RESET WHEN is a Teradata feature used in SQL analytical window functions. It’s an extension to the ANSI SQL standard. RESET WHEN determines the partition over which a SQL window function operates based on a specified condition. If the condition evaluates to true, a new dynamic sub-partition is created inside the existing window partition.
For example, the following view uses RESET WHEN to compute a running total by store. The running total accumulates as long as sales increase month over month. If sales drop from one month to the next, the running total resets.
To demonstrate, we insert some test data into the table:
The sales amounts drop after months 3 and 7. The running total is reset accordingly at months 4 and 8.
AWS SCT converts the view as follows. The converted code uses a subquery to emulate the RESET WHEN. Essentially, a marker attribute is added to the result that flags a month over month sales drop. The flag is then used to determine the longest preceding run of increasing sales to aggregate.
We expect that RESET WHEN conversion will be a big hit with customers. You can try it now in AWS SCT.
TD_NORMALIZE_OVERLAP() function
The TD_NORMALIZE_OVERLAP function combines rows that have overlapping PERIOD values. The resulting normalized row contains the earliest starting bound and the latest ending bound from the PERIOD values of all the rows involved.
For example, we create a Teradata table that records employee salaries with the following code. Each row in the table is timestamped with the period that the employee was paid the given salary.
Now we add data for two employees. For emp_id = 1
and salary = 2000
, there are two overlapping rows. Similarly, the two rows with emp_id = 2
and salary = 3000
are overlapping.
Now we create a view that uses the TD_NORMALIZE_OVERLAP function to normalize the overlapping data: