AWS Big Data Blog

Accelerate your data warehouse migration to Amazon Redshift – Part 3

This is the third post in a multi-part series. We’re excited to share dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and reduce your overall cost to migrate to Amazon Redshift.

Check out all the posts in this series:

Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other services such as Amazon EMR, Amazon Athena, and Amazon SageMaker to use all the analytic capabilities in the AWS Cloud.

Many customers have asked for help migrating from self-managed data warehouse engines, like Teradata, to Amazon Redshift. In these cases, you may have terabytes (or petabytes) of historical data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over years (or decades) of use.

Until now, migrating a Teradata data warehouse to AWS was complex and involved a significant amount of manual effort.

Today, we’re happy to share recent enhancements to Amazon Redshift and the AWS Schema Conversion Tool (AWS SCT) that make it easier to automate your Teradata to Amazon Redshift migrations.

In this post, we introduce new automation for merge statements, a native function to support ASCII character conversion, enhanced error checking for string to date conversion, enhanced support for Teradata cursors and identity columns, automation for ANY and SOME predicates, automation for RESET WHEN clauses, automation for two proprietary Teradata functions (TD_NORMALIZE_OVERLAP and TD_UNPIVOT), and automation to support analytic functions (QUANTILE and QUALIFY).

Merge statement

Like its name implies, the merge statement takes an input set and merges it into a target table. If an input row already exists in the target table (a row in the target table has the same primary key value), then the target row is updated. If there is no matching target row, the input row is inserted into the table.

Until now, if you used merge statements in your workload, you were forced to manually rewrite the merge statement to run on Amazon Redshift. Now, we’re happy to share that AWS SCT automates this conversion for you. AWS SCT decomposes a merge statement into an update on existing records followed by an insert for new records.

Let’s look at an example. We create two tables in Teradata: a target table, employee, and a delta table, employee_delta, where we stage the input rows:

CREATE TABLE testschema.employee(
  id INTEGER
, name VARCHAR(20)
, manager INTEGER)
UNIQUE PRIMARY INDEX (id)
;

CREATE TABLE testschema.employee_delta (
  id INTEGER
, name VARCHAR(20)
, manager INTEGER)
UNIQUE PRIMARY INDEX(id)
;

Now we create a Teradata merge statement that updates a row if it exists in the target, otherwise it inserts the new row. We embed this merge statement into a macro so we can show you the conversion process later.

REPLACE MACRO testschema.merge_employees AS (
  MERGE INTO testschema.employee tgt
  USING testschema.employee_delta delta
    ON delta.id = tgt.id
  WHEN MATCHED THEN
    UPDATE SET name = delta.name, manager = delta.manager
  WHEN NOT MATCHED THEN
    INSERT (delta.id, delta.name, delta.manager);
);

Now we use AWS SCT to convert the macro. (See Accelerate your data warehouse migration to Amazon Redshift – Part 1 for details on macro conversion.) AWS SCT creates a stored procedure that contains an update (to implement the WHEN MATCHED condition) and an insert (to implement the WHEN NOT MATCHED condition).

CREATE OR REPLACE PROCEDURE testschema.merge_employees()
AS $BODY$
BEGIN
    UPDATE testschema.employee
    SET name = "delta".name, manager = "delta".manager
    FROM testschema.employee_delta AS delta JOIN testschema.employee AS tgt
        ON "delta".id = tgt.id;
      
    INSERT INTO testschema.employee
    SELECT
      "delta".id
    , "delta".name
    , "delta".manager
    FROM testschema.employee_delta AS delta
    WHERE NOT EXISTS (
      SELECT 1
      FROM testschema.employee AS tgt
      WHERE "delta".id = tgt.id
    );
END;
$BODY$
LANGUAGE plpgsql;

This example showed how to use merge automation for macros, but you can convert merge statements in any application context: stored procedures, BTEQ scripts, Java code, and more. Download the latest version of AWS SCT and try it out.

ASCII() function

The ASCII function takes as input a string and returns the ASCII code, or more precisely, the UNICODE code point, of the first character in the string. Previously, Amazon Redshift supported ASCII as a leader-node only function, which prevented its use with user-defined tables.

We’re happy to share that the ASCII function is now available on Amazon Redshift compute nodes and can be used with user-defined tables. In the following code, we create a table with some string data:

CREATE TABLE testschema.char_table (
  id INTEGER
, char_col  CHAR(10)
, varchar_col VARCHAR(10)
);

INSERT INTO testschema.char_table VALUES (1, 'Hello', 'world');

Now you can use the ASCII function on the string columns:

# SELECT id, char_col, ascii(char_col), varchar_col, ascii(varchar_col) FROM testschema.char_table;

 id |  char_col  | ascii | varchar_col | ascii 
  1 | Hello      |    72 | world       |   119

Lastly, if your application code uses the ASCII function, AWS SCT automatically converts any such function calls to Amazon Redshift.

The ASCII feature is available now—try it out in your own cluster.

TO_DATE() function

The TO_DATE function converts a character string into a DATE value. A quirk of this function is that it can accept a string value that isn’t a valid date and translate it into a valid date.

For example, consider the string 2021-06-31. This isn’t a valid date because the month of June has only 30 days. However, the TO_DATE function accepts this string and returns the “31st” day of June (July 1):

# SELECT to_date('2021-06-31', 'YYYY-MM-DD');
 to_date 
 2021-07-01
(1 row)

Customers have asked for strict input checking for TO_DATE, and we’re happy to share this new capability. Now, you can include a Boolean value in the function call that turns on strict checking:

# SELECT to_date('2021-06-31', 'YYYY-MM-DD', TRUE);
ERROR: date/time field date value out of range: 2021-6-31

You can turn off strict checking explicitly as well:

# SELECT to_date('2021-06-31', 'YYYY-MM-DD', FALSE);
 to_date 
 2021-07-01
(1 row)

Also, the Boolean value is optional. If you don’t include it, strict checking is turned off, and you see the same behavior as before the feature was launched.

You can learn more about the TO_DATE function and try out strict date checking in Amazon Redshift now.

CURSOR result sets

A cursor is a programming language construct that applications use to manipulate a result set one row at a time. Cursors are more relevant for OLTP applications, but some legacy applications built on data warehouses also use them.

Teradata provides a diverse set of cursor configurations. Amazon Redshift supports a more streamlined set of cursor features.

Based on customer feedback, we’ve added automation to support Teradata WITH RETURN cursors. These types of cursors are opened within stored procedures and returned to the caller for processing of the result set. AWS SCT will convert a WITH RETURN cursor to an Amazon Redshift REFCURSOR.

For example, consider the following procedure, which contains a WITH RETURN cursor. The procedure opens the cursor and returns the result to the caller as a DYNAMIC RESULT SET:

REPLACE PROCEDURE testschema.employee_cursor (IN p_mgrid INTEGER) DYNAMIC RESULT SETS 1
BEGIN
   DECLARE result_set CURSOR WITH RETURN ONLY FOR 
     SELECT id, name, manager 
     FROM testschema.employee
     WHERE manager = to_char(p_mgrid); 
   OPEN result_set;
END;

AWS SCT converts the procedure as follows. An additional parameter is added to the procedure signature to pass the REFCURSOR:

CREATE OR REPLACE PROCEDURE testschema.employee_cursor(par_p_mgrid IN INTEGER, dynamic_return_cursor INOUT refcursor)
AS $BODY$
DECLARE
BEGIN
    OPEN dynamic_return_cursor FOR
    SELECT
        id, name, manager
        FROM testschema.employee
        WHERE manager = to_char(par_p_mgrid, '99999');
END;
$BODY$
LANGUAGE plpgsql;

IDENTITY columns

Teradata supports several non-ANSI compliant features for IDENTITY columns. We have enhanced AWS SCT to automatically convert these features to Amazon Redshift, whenever possible.

Specifically, AWS SCT now converts the Teradata START WITH and INCREMENT BY clauses to the Amazon Redshift SEED and STEP clauses, respectively. For example, consider the following Teradata table:

CREATE TABLE testschema.identity_table (
  a2 BIGINT GENERATED ALWAYS AS IDENTITY (
    START WITH 1 
    INCREMENT BY 20
  )
);

The GENERATED ALWAYS clause indicates that the column is always populated automatically—a value can’t be explicitly inserted or updated into the column. The START WITH clause defines the first value to be inserted into the column, and the INCREMENT BY clause defines the next value to insert into the column.

When you convert this table using AWS SCT, the following Amazon Redshift DDL is produced. Notice that the START WITH and INCREMENT BY values are preserved in the target syntax:

CREATE TABLE IF NOT EXISTS testschema.identity_table (
  a2 BIGINT IDENTITY(1, 20)
)
DISTSTYLE KEY
DISTKEY
(a2)
SORTKEY
(a2);

Also, by default, an IDENTITY column in Amazon Redshift only contains auto-generated values, so that the GENERATED ALWAYS property in Teradata is preserved:

# INSERT INTO testschema.identity_table VALUES (100);
ERROR:  cannot set an identity column to a value

IDENTITY columns in Teradata can also be specified as GENERATED BY DEFAULT. In this case, a value can be explicitly defined in an INSERT statement. If no value is specified, the column is filled with an auto-generated value like normal. Before, AWS SCT didn’t support conversion for GENERATED BY DEFAULT columns. Now, we’re happy to share that AWS SCT automatically converts such columns for you.

For example, the following table contains an IDENTITY column that is GENERATED BY DEFAULT:

CREATE TABLE testschema.identity_by_default (
  a1 BIGINT GENERATED BY DEFAULT AS IDENTITY (
     START WITH 1 
     INCREMENT BY 20 
  )
PRIMARY INDEX (a1);

The IDENTITY column is converted by AWS SCT as follows. The converted column uses the Amazon Redshift GENERATED BY DEFAULT clause:

CREATE TABLE testschema.identity_by_default (
  a1 BIGINT GENERATED BY DEFAULT AS IDENTITY(1,20) DISTKEY
)                                                          
 DISTSTYLE KEY                                               
 SORTKEY (a1);

There is one additional syntax issue that requires attention. In Teradata, an auto-generated value is inserted when NULL is specified for the column value:

INSERT INTO identity_by_default VALUES (null);

Amazon Redshift uses a different syntax for the same purpose. Here, you include the keyword DEFAULT in the values list to indicate that the column should be auto-generated:

INSERT INTO testschema.identity_by_default VALUES (default);

We’re happy to share that AWS SCT automatically converts the Teradata syntax for INSERT statements like the preceding example. For example, consider the following Teradata macro:

REPLACE MACRO testschema.insert_identity_by_default AS (
  INSERT INTO testschema.identity_by_default VALUES (NULL);
);

AWS SCT removes the NULL and replaces it with DEFAULT:

CREATE OR REPLACE PROCEDURE testschema.insert_identity_by_default() LANGUAGE plpgsql
AS $$                                                              
BEGIN                                                              
  INSERT INTO testschema.identity_by_default VALUES (DEFAULT);
END;                                                               
$$                                                                 

IDENTITY column automation is available now in AWS SCT. You can download the latest version and try it out.

ANY and SOME filters with inequality predicates

The ANY and SOME filters determine if a predicate applies to one or more values in a list. For example, in Teradata, you can use <> ANY to find all employees who don’t work for a certain manager:

REPLACE MACRO testschema.not_in_103 AS (
  SELECT *
  FROM testschema.employee 
  WHERE manager <> ANY (103)
;
);

Of course, you can rewrite this query using a simple not equal filter, but you often see queries from third-party SQL generators that follow this pattern.

Amazon Redshift doesn’t support this syntax natively. Before, any queries using this syntax had to be manually converted. Now, we’re happy to share that AWS SCT automatically converts ANY and SOME clauses with inequality predicates. The macro above is converted to a stored procedure as follows.

CREATE OR REPLACE PROCEDURE testschema.not_in_103(macro_out INOUT refcursor)
AS $BODY$
BEGIN
    OPEN macro_out FOR
    SELECT *
    FROM testschema.employee
    WHERE ((manager <> 103));
END;
$BODY$
LANGUAGE plpgsql;

If the values list following the ANY contains two more values, AWS SCT will convert this to a series of OR conditions, one for each element in the list.

ANY/SOME filter conversion is available now in AWS SCT. You can try it out in the latest version of the application.

Analytic functions with RESET WHEN

RESET WHEN is a Teradata feature used in SQL analytical window functions. It’s an extension to the ANSI SQL standard. RESET WHEN determines the partition over which a SQL window function operates based on a specified condition. If the condition evaluates to true, a new dynamic sub-partition is created inside the existing window partition.

For example, the following view uses RESET WHEN to compute a running total by store. The running total accumulates as long as sales increase month over month. If sales drop from one month to the next, the running total resets.

CREATE TABLE testschema.sales (
  store_id INTEGER
, month_no INTEGER
, sales_amount DECIMAL(9,2)
)
;

REPLACE VIEW testschema.running_total (
  store_id
, month_no
, sales_amount
, cume_sales_amount
)
AS
SELECT 
  store_id
, month_no
, sales_amount
, SUM(sales_amount) OVER (
     PARTITION BY store_id 
     ORDER BY month_no
     RESET WHEN sales_amount < SUM(sales_amount) OVER (
       PARTITION BY store_id
       ORDER BY month_no
       ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
     )
     ROWS UNBOUNDED PRECEDING 
  )
FROM testschema.sales;

To demonstrate, we insert some test data into the table:

INSERT INTO testschema.sales VALUES (1001, 1, 35000.00);
INSERT INTO testschema.sales VALUES (1001, 2, 40000.00);
INSERT INTO testschema.sales VALUES (1001, 3, 45000.00);
INSERT INTO testschema.sales VALUES (1001, 4, 25000.00);
INSERT INTO testschema.sales VALUES (1001, 5, 30000.00);
INSERT INTO testschema.sales VALUES (1001, 6, 30000.00);
INSERT INTO testschema.sales VALUES (1001, 7, 50000.00);
INSERT INTO testschema.sales VALUES (1001, 8, 35000.00);
INSERT INTO testschema.sales VALUES (1001, 9, 60000.00);
INSERT INTO testschema.sales VALUES (1001, 10, 80000.00);
INSERT INTO testschema.sales VALUES (1001, 11, 90000.00);
INSERT INTO testschema.sales VALUES (1001, 12, 100000.00);

The sales amounts drop after months 3 and 7. The running total is reset accordingly at months 4 and 8.

SELECT * FROM testschema.running_total;

   store_id     month_no  sales_amount  cume_sales_amount
-----------  -----------  ------------  -----------------
       1001            1      35000.00           35000.00
       1001            2      40000.00           75000.00
       1001            3      45000.00          120000.00
       1001            4      25000.00           25000.00
       1001            5      30000.00           55000.00
       1001            6      30000.00           85000.00
       1001            7      50000.00          135000.00
       1001            8      35000.00           35000.00
       1001            9      60000.00           95000.00
       1001           10      80000.00          175000.00
       1001           11      90000.00          265000.00
       1001           12     100000.00          365000.00

AWS SCT converts the view as follows. The converted code uses a subquery to emulate the RESET WHEN. Essentially, a marker attribute is added to the result that flags a month over month sales drop. The flag is then used to determine the longest preceding run of increasing sales to aggregate.

CREATE OR REPLACE VIEW testschema.running_total (
  store_id
, month_no
, sales_amount
, cume_sales_amount) AS
SELECT
  store_id
, month_no
, sales_amount
, sum(sales_amount) OVER 
    (PARTITION BY k1, store_id ORDER BY month_no NULLS 
     FIRST ROWS UNBOUNDED      PRECEDING)
FROM (
  SELECT
   store_id
 , month_no
 , sales_amount
 , SUM(CASE WHEN k = 1 THEN 0 ELSE 1 END) OVER 
     (PARTITION BY store_id ORDER BY month_no NULLS 
       FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS k1
 FROM (
   SELECT
     store_id
   , month_no
   , sales_amount
   , CASE WHEN sales_amount < SUM(sales_amount) OVER 
      (PARTITION BY store_id ORDER BY month_no 
        ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) 
      OR sales_amount IS NULL THEN 0 ELSE 1 END AS k
   FROM testschema.sales
  )
);

We expect that RESET WHEN conversion will be a big hit with customers. You can try it now in AWS SCT.

TD_NORMALIZE_OVERLAP() function

The TD_NORMALIZE_OVERLAP function combines rows that have overlapping PERIOD values. The resulting normalized row contains the earliest starting bound and the latest ending bound from the PERIOD values of all the rows involved.

For example, we create a Teradata table that records employee salaries with the following code. Each row in the table is timestamped with the period that the employee was paid the given salary.

CREATE TABLE testschema.salaries (
  emp_id INTEGER
, salary DECIMAL(8,2)
, from_to PERIOD(DATE)
);

Now we add data for two employees. For emp_id = 1 and salary = 2000, there are two overlapping rows. Similarly, the two rows with emp_id = 2 and salary = 3000 are overlapping.

SELECT * FROM testschema.salaries ORDER BY emp_id, from_to;

     emp_id      salary  from_to
-----------  ----------  ------------------------
          1     1000.00  ('20/01/01', '20/05/31')
          1     2000.00  ('20/06/01', '21/02/28')
          1     2000.00  ('21/01/01', '21/06/30')
          2     3000.00  ('20/01/01', '20/03/31')
          2     3000.00  ('20/02/01', '20/04/30')

Now we create a view that uses the TD_NORMALIZE_OVERLAP function to normalize the overlapping data:

REPLACE VIEW testschema.normalize_salaries AS 
WITH sub_table(emp_id, salary, from_to) AS (
  SELECT 
    emp_id
  , salary
  , from_to
  FROM testschema.salaries
)
SELECT *
FROM 
  TABLE(TD_SYSFNLIB.TD_NORMALIZE_OVERLAP (NEW VARIANT_TYPE(sub_table.emp_id, sub_table.salary), sub_table.from_to)
    RETURNS (emp_id INTEGER, salary DECIMAL(8,2), from_to PERIOD(DATE))
    HASH BY emp_id
    LOCAL ORDER BY emp_id, salary, from_to
  ) AS DT(emp_id, salary, duration)
;

We can check that the view data is actually normalized:

select * from testschema.normalize_salaries order by emp_id, duration;

     emp_id      salary  duration
-----------  ----------  ------------------------
          1     1000.00  ('20/01/01', '20/05/31')
          1     2000.00  ('20/06/01', '21/06/30')
          2     3000.00  ('20/01/01', '20/04/30')

You can now use AWS SCT to convert any TD_NORMALIZE_OVERLAP statements. We first convert the salaries table to Amazon Redshift (see Accelerate your data warehouse migration to Amazon Redshift – Part 2 for details about period data type automation):

CREATE TABLE testschema.salaries (
  emp_id integer distkey
, salary numeric(8,2) ENCODE az64
, from_to_begin date ENCODE az64
, from_to_end date ENCODE az64    
)                                   
DISTSTYLE KEY                       
SORTKEY (emp_id);

# SELECT * FROM testschema.salaries ORDER BY emp_id, from_to_begin;
 emp_id | salary  | from_to_begin | from_to_end 
      1 | 1000.00 | 2020-01-01    | 2020-05-31
      1 | 2000.00 | 2020-06-01    | 2021-02-28
      1 | 2000.00 | 2021-01-01    | 2021-06-30
      2 | 3000.00 | 2020-01-01    | 2020-03-31
      2 | 3000.00 | 2020-02-01    | 2020-04-30

Now we use AWS SCT to convert the normalize_salaries view. AWS SCT adds a column that marks the start of a new group of rows. It then produces a single row for each group with a normalized timestamp.

CREATE VIEW testschema.normalize_salaries (emp_id, salary, from_to_begin, from_to_end) AS
WITH sub_table AS (
  SELECT
    emp_id
  , salary
  , from_to_begin AS start_date
  , from_to_end AS end_date
  , CASE
      WHEN start_date <= lag(end_date) OVER (PARTITION BY emp_id, salary ORDER BY start_date, end_date) THEN 0 
      ELSE 1
    END AS GroupStartFlag
    FROM testschema.salaries
  )
SELECT
  t2.emp_id
, t2.salary
, min(t2.start_date) AS from_to_begin
, max(t2.end_date) AS from_to_end
FROM (
  SELECT
    emp_id
  , salary
  , start_date
  , end_date
  , sum(GroupStartFlag) OVER (PARTITION BY emp_id, salary ORDER BY start_date ROWS UNBOUNDED PRECEDING) AS GroupID
  FROM 
    sub_table
) AS t2
GROUP BY 
  t2.emp_id
, t2.salary
, t2.GroupID;

We can check that the converted view returns the correctly normalized data:

# SELECT * FROM testschema.normalize_salaries ORDER BY emp_id;
 emp_id | salary  | from_to_begin | from_to_end 
      1 | 1000.00 | 2020-01-01    | 2020-05-31
      1 | 2000.00 | 2020-06-01    | 2021-06-30
      2 | 3000.00 | 2020-01-01    | 2020-04-30

You can try out TD_NORMALIZE_OVERLAP conversion in the latest release of AWS SCT. Download it now.

TD_UNPIVOT() function

The TD_UNPIVOT function transforms columns into rows. Essentially, we use it to take a row of similar metrics over different time periods and create a separate row for each metric.

For example, consider the following Teradata table. The table records customer visits by year and month for small kiosk stores:

CREATE TABLE TESTSCHEMA.kiosk_monthly_visits (
  kiosk_id INTEGER
, year_no INTEGER
, jan_visits INTEGER
, feb_visits INTEGER
, mar_visits INTEGER
, apr_visits INTEGER
, may_visits INTEGER
, jun_visits INTEGER
, jul_visits INTEGER
, aug_visits INTEGER
, sep_visits INTEGER
, oct_visits INTEGER
, nov_visits INTEGER
, dec_visits INTEGER)
PRIMARY INDEX (kiosk_id);

We insert some sample data into the table:

INSERT INTO testschema.kiosk_monthly_visits VALUES (100, 2020, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200);

Next, we create a view that unpivots the table so that the monthly visits appear on separate rows. The single row in the pivoted table creates 12 rows in the unpivoted table, one row per month.

REPLACE VIEW testschema.unpivot_kiosk_monthly_visits (
  kiosk_id
, year_no
, month_name
, month_visits
)
AS
SELECT 
  kiosk_id
, year_no
, month_name (FORMAT 'X(10)')
, month_visits
FROM TD_UNPIVOT (
 ON (SELECT * FROM testschema.kiosk_monthly_visits)
 USING
 VALUE_COLUMNS ('month_visits')
 UNPIVOT_COLUMN('month_name')
 COLUMN_LIST(
   'jan_visits'
 , 'feb_visits'
 , 'mar_visits'
 , 'apr_visits'
 , 'may_visits'
 , 'jun_visits'
 , 'jul_visits'
 , 'aug_visits'
 , 'sep_visits'
 , 'oct_visits'
 , 'nov_visits'
 , 'dec_visits'
 )
 COLUMN_ALIAS_LIST (
   'jan'
 , 'feb'
 , 'mar'
 , 'apr'
 , 'may'
 , 'jun'
 , 'jul'
 , 'aug'
 , 'sep'
 , 'oct'
 , 'nov'
 , 'dec'
 )
) a;

When you select from the view, the monthly sales are unpivoted into 12 separate rows:

SELECT * FROM testschema.unpivot_monthly_sales;

 id           yr           mon     mon_sales
----------- ----------- ---------- ----------
100         2021        jan           1100.00
100         2021        feb           1200.00
100         2021        mar           1300.00
100         2021        apr           1400.00
100         2021        may           1500.00
100         2021        jun           1600.00
100         2021        jul           1700.00
100         2021        aug           1800.00
100         2021        sep           1900.00
100         2021        oct           2000.00
100         2021        nov           2100.00
100         2021        dec           2200.00

Now we use AWS SCT to convert the view into ANSI SQL that can be run on Amazon Redshift. The conversion creates a common table expression (CTE) to place each month in a separate row. It then joins the CTE and the remaining attributes from the original pivoted table.

REPLACE VIEW testschema.unpivot_kiosk_monthly_visits (kiosk_id, year_no, month_name, month_visits) AS
WITH cols
AS (SELECT
    'jan' AS col
UNION ALL
SELECT
    'feb' AS col
UNION ALL
SELECT
    'mar' AS col
UNION ALL
SELECT
    'apr' AS col
UNION ALL
SELECT
    'may' AS col
UNION ALL
SELECT
    'jun' AS col
UNION ALL
SELECT
    'jul' AS col
UNION ALL
SELECT
    'aug' AS col
UNION ALL
SELECT
    'sep' AS col
UNION ALL
SELECT
    'oct' AS col
UNION ALL
SELECT
    'nov' AS col
UNION ALL
SELECT
    'dec' AS col)
SELECT
    t1.kiosk_id, t1.year_no, col AS "month_name",
    CASE col
        WHEN 'jan' THEN "jan_visits"
        WHEN 'feb' THEN "feb_visits"
        WHEN 'mar' THEN "mar_visits"
        WHEN 'apr' THEN "apr_visits"
        WHEN 'may' THEN "may_visits"
        WHEN 'jun' THEN "jun_visits"
        WHEN 'jul' THEN "jul_visits"
        WHEN 'aug' THEN "aug_visits"
        WHEN 'sep' THEN "sep_visits"
        WHEN 'oct' THEN "oct_visits"
        WHEN 'nov' THEN "nov_visits"
        WHEN 'dec' THEN "dec_visits"
        ELSE NULL
    END AS "month_visits"
    FROM testschema.kiosk_monthly_visits AS t1
    CROSS JOIN cols
    WHERE month_visits IS NOT NULL;

You can check that the converted view produces the same result as the Teradata version:

# SELECT * FROM testschema.unpivot_kiosk_monthly_visits;
 kiosk_id | year_no | month_name | month_visits 
      100 |    2020 | oct        |        2000
      100 |    2020 | nov        |        2100
      100 |    2020 | jul        |        1700
      100 |    2020 | feb        |        1200
      100 |    2020 | apr        |        1400
      100 |    2020 | aug        |        1800
      100 |    2020 | sep        |        1900
      100 |    2020 | jan        |        1100
      100 |    2020 | mar        |        1300
      100 |    2020 | may        |        1500
      100 |    2020 | jun        |        1600
      100 |    2020 | dec        |        2200

You can try out the conversion support for TD_UNPIVOT in the latest version of AWS SCT.

QUANTILE function

QUANTILE is a ranking function. It partitions the input set into a specified number of groups, each containing an equal portion of the total population. QUANTILE is a proprietary Teradata extension of the NTILE function found in ANSI SQL.

For example, we can compute the quartiles of the monthly visit data using the following Teradata view:

REPLACE VIEW testschema.monthly_visit_rank AS
SELECT
  kiosk_id
, year_no
, month_name
, month_visits
, QUANTILE(4, month_visits) qtile
FROM
 testschema.unpivot_kiosk_monthly_visits
;

When you select from the view, the QUANTILE function computes the quartile and applies it as an attribute on the output:

SELECT * FROM monthly_visit_rank;

   kiosk_id      year_no  month_name  month_visits        qtile
-----------  -----------  ----------  ------------  -----------
        100         2020  jan                 1100            0
        100         2020  feb                 1200            0
        100         2020  mar                 1300            0
        100         2020  apr                 1400            1
        100         2020  may                 1500            1
        100         2020  jun                 1600            1
        100         2020  jul                 1700            2
        100         2020  aug                 1800            2
        100         2020  sep                 1900            2
        100         2020  oct                 2000            3
        100         2020  nov                 2100            3
        100         2020  dec                 2200            3

Amazon Redshift supports a generalized NTILE function, which can implement QUANTILE, and is ANSI-compliant. We’ve enhanced AWS SCT to automatically convert QUANTILE function calls to equivalent NTILE function calls.

For example, when you convert the preceding Teradata view, AWS SCT produces the following Amazon Redshift code:

SELECT 
  unpivot_kiosk_monthly_visits.kiosk_id
, unpivot_kiosk_monthly_visits.year_no
, unpivot_kiosk_monthly_visits.month_name
, unpivot_kiosk_monthly_visits.month_visits
, ntile(4) OVER (ORDER BY unpivot_kiosk_monthly_visits.month_visits ASC  NULLS FIRST) - 1) AS qtile 
FROM 
  testschema.unpivot_kiosk_monthly_visits
;

QUANTILE conversion support is available now in AWS SCT.

QUALIFY filter

The QUALIFY clause in Teradata filters rows produced by an analytic function. Let’s look at an example. We use the following table, which contains store revenue by month. Our goal is to find the top five months by revenue:

CREATE TABLE testschema.sales (
  store_id INTEGER
, month_no INTEGER
, sales_amount DECIMAL(9,2))
PRIMARY INDEX (store_id);


SELECT * FROM sales;

   store_id     month_no  sales_amount
-----------  -----------  ------------
       1001            1      35000.00
       1001            2      40000.00
       1001            3      45000.00
       1001            4      25000.00
       1001            5      30000.00
       1001            6      30000.00
       1001            7      50000.00
       1001            8      35000.00
       1001            9      60000.00
       1001           10      80000.00
       1001           11      90000.00
       1001           12     100000.00

The data shows that July, September, October, November, and December were the top five sales months.

We create a view that uses the RANK function to rank each month by sales, then use the QUALIFY function to select the top five months:

REPLACE VIEW testschema.top_five_months(
  store_id
, month_no
, sales_amount
, month_rank
) as
SELECT
  store_id
, month_no
, sales_amount
, RANK() OVER (PARTITION BY store_id ORDER BY sales_amount DESC) month_rank
FROM
  testschema.sales
QUALIFY RANK() OVER (PARTITION by store_id ORDER BY sales_amount DESC) <= 5
;

Before, if you used the QUALIFY clause, you had to manually recode your SQL statements. Now, AWS SCT automatically converts QUALIFY into Amazon Redshift-compatible, ANSI-compliant SQL. For example, AWS SCT rewrites the preceding view as follows:

CREATE OR REPLACE VIEW testschema.top_five_months (
  store_id
, month_no
, sales_amount
, month_rank) AS
SELECT
  qualify_subquery.store_id
, qualify_subquery.month_no
, qualify_subquery.sales_amount
, month_rank
FROM (
  SELECT
    store_id
  , month_no
  , sales_amount
  , rank() OVER (PARTITION BY store_id ORDER BY sales_amount DESC NULLS FIRST) AS month_rank
  , rank() OVER (PARTITION BY store_id ORDER BY sales_amount DESC NULLS FIRST) AS qualify_expression_1
  FROM testschema.sales) AS qualify_subquery
  WHERE 
    qualify_expression_1 <= 5;

AWS SCT converts the original query into a subquery, and applies the QUALIFY expression as a filter on the subquery. AWS SCT adds an additional column to the subquery for the purpose of filtering. This is not strictly needed, but simplifies the code when column aliases aren’t used.

You can try QUALIFY conversion in the latest version of AWS SCT.

Summary

We’re happy to share these new features with you. If you’re contemplating a migration to Amazon Redshift, these capabilities can help automate your schema conversion and preserve your investment in existing reports and applications. If you’re looking to get started on a data warehouse migration, you can learn more about Amazon Redshift and AWS SCT from our public documentation.

This post described a few of the dozens of new features we’re introducing to automate your Teradata migrations to Amazon Redshift. We’ll share more in upcoming posts about automation for proprietary Teradata features and other exciting new capabilities.

Check back soon for more information. Until then, you can learn more about Amazon Redshift and the AWS Schema Conversion Tool. Happy migrating!


About the Authors

Michael Soo is a Senior Database Engineer with the AWS Database Migration Service team. He builds products and services that help customers migrate their database workloads to the AWS cloud.

Raza Hafeez is a Data Architect within the Lake House Global Specialty Practice of AWS Professional Services. He has over 10 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Lake House Architecture.

Po Hong, PhD, is a Principal Data Architect of Lake House Global Specialty Practice, AWS Professional Services. He is passionate about supporting customers to adopt innovative solutions to reduce time to insight. Po is specialized in migrating large scale MPP on-premises data warehouses to the AWS Lake House architecture.

Entong Shen is a Software Development Manager of Amazon Redshift. He has been working on MPP databases for over 9 years and has focused on query optimization, statistics and migration related SQL language features such as stored procedures and data types.

Sumit Singh is a database engineer with Database Migration Service team at Amazon Web Services. He works closely with customers and provide technical assistance to migrate their on-premises workload to AWS cloud. He also assists in continuously improving the quality and functionality of AWS Data migration products.

Nelly Susanto is a Senior Database Migration Specialist of AWS Database Migration Accelerator. She has over 10 years of technical background focusing on migrating and replicating databases along with data-warehouse workloads. She is passionate about helping customers in their cloud journey.