AWS Big Data Blog
Code conversion from Greenplum to Amazon Redshift: Handling arrays, dates, and regular expressions
Amazon Redshift is a fully managed service for data lakes, data analytics, and data warehouses for startups, medium enterprises, and large enterprises. Amazon Redshift is used by tens of thousands of businesses around the globe for modernizing their data analytics platform.
Greenplum is an open-source, massively parallel database used for analytics, mostly for on-premises infrastructure. Greenplum is based on the PostgreSQL database engine.
Many customers have found migration to Amazon Redshift from Greenplum an attractive option instead of managing on-premises Greenplum for the following reasons:
- The opportunity to modernize the data lake and data warehouse environment
- Benefits of other AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon CloudWatch, Amazon EMR, Amazon SageMaker, and more
Even though both Greenplum and Amazon Redshift use the open-source PostgreSQL database engine, migration still requires a lot of planning and manual intervention. This post covers the key functions and considerations while performing code conversion from Greenplum to Amazon Redshift. It is focused on the migration of procedures, functions, and views.
Solution overview
AWS Database Migration Service (AWS DMS) and the AWS Schema Conversion Tool (AWS SCT) can migrate most of the objects in a heterogeneous database migration from Greenplum to Amazon Redshift. But there are some situations where code conversion teams encounter errors and warnings for views, procedures, and functions while creating them in Amazon Redshift. To address this type of situation, manual conversion of the code is required.
The posts focuses on how to handle the following while migrating from Greenplum to Amazon Redshift:
- Arrays
- Dates and timestamps
- Regular expressions (regex)
Please note that for this post, we use Greenplum 4.3 and Amazon Redshift PostgreSQL 8.2.
Working with array functions
The AWS SCT doesn’t convert array functions while migrating from Greenplum or PostgreSQL to Amazon Redshift. Developers need to extensively convert those functions manually. This post outlines the most common array functions:
- ARRAY_UPPER
- JSON_EXTACT_ARRAY_ELEMENT_TEXT and JSON_ARRAY_LENGTH
- UNNEST ()
- STRING_AGG()
- ANY ARRAY()
ARRAY_UPPER()
This function returns the upper bound of an array. It can be used to extract the nth element from an array in PostgreSQL or Greenplum.
The Greenplum code is as follows:
There is no function to extract an element from an array in Amazon Redshift; however, there are two JSON functions that can be used for this purpose:
- JSON_EXTRACT_ARRAY_ELEMENT_TEXT() – Returns a JSON array element in the outermost array of a JSON string
- JSON_ARRAY_LENGTH() – Returns the number of elements in the outer array of a JSON string
See the following code:
UNNEST()
UNNEST() is PostgreSQL’s system function for semi-structured data, expanding an array, or a combination of arrays to a set of rows. It is introduced to improve the database performance of thousands or records for inserts, updates, and deletes.
You can use UNNEST() for basic array, multiple arrays, and multiple arrays with different lengths.
Some of Amazon Redshift functions used to unnest arrays are split_part
, json_extract_path_text
, json_array_length
, and json_extract_array_element_text
.
In Greenplum, the UNNEST function is used to expand an array to a set of rows:
Output
A 1
A 2
Amazon Redshift doesn’t support the UNNEST function; you can use the following workaround:
STRING_AGG()
The STRING_AGG() function is an aggregate function that concatenates a list of strings and places a separator between them. The function doesn’t add the separator at the end of the string. See the following code:
The Greenplum code is as follows:
The Amazon Redshift equivalent for the STRING_AGG() function is LISTAGG(). This aggregate function orders the rows for that group according to the ORDER BY expression, then concatenates the values into a single string:
See the following code:
ANY ARRAY()
The PostgreSQL ANY ARRAY() function evaluates and compare the left-hand expression to each element in array:
In Amazon Redshift, the evaluation can be achieved with an IN operator:
Working with date functions
In this section, we discuss calculating the difference between date_part
for Greenplum and datediff for Amazon Redshift.
When the application needs to calculate the difference between the subfields of dates for Greenplum, it uses the function date_part, which allows you to retrieve subfields such as year, month, week, and day. In the following example queries, we calculate the number of completion_days
by calculating the difference between originated_date
and eco_date
.
To calculate the difference between the subfields of the date, Amazon Redshift has the function datediff. The following queries show an example of how to calculate the completion_days
as the difference between eco_date
and orginated_date
. DATEDIFF determines the number of date part boundaries that are crossed between the two expressions.
We compare the Greenplum and Amazon Redshift queries as follows:
- Difference by year
The following Greenplum query returns 1 year between 2009-01-01 and 2009-12-31:
The following Amazon Redshift query returns 1 year between 2009-01-01 and 2009-12-31:
- Difference by month
The following Greenplum query returns 1 month between 2009-01-01 and 2008-12-31:
The following Amazon Redshift query returns 1 month between 2009-01-01 and 2008-12-31:
- Difference by week
The following Greenplum query returns 0 weeks between 2009-01-01 and 2009-12-31:
The following Amazon Redshift query returns 0 weeks between 2009-01-01 and 2009-12-31:
- Difference by day
The following Greenplum query returns 1 day:
The following Amazon Redshift query returns 1 day:
- Difference by hour
The following Greenplum query returns 1 hour:
The following Amazon Redshift query returns 1 hour:
- Difference by minute
The following Greenplum query returns 3 minutes:
The following Amazon Redshift query returns 1 minute:
- Difference by second
The following Greenplum query returns 40 seconds:
The following Amazon Redshift query returns 45 seconds:
Now let’s look at how we use Amazon Redshift to calculate days and weeks in seconds.
The following Amazon Redshift query displays 2 days:
The following Amazon Redshift query displays 9 weeks:
For Greenplum, the date subfields need to be in single quotes, whereas for Amazon Redshift, we can use date subfields such as year, month, week, day, minute, second without quotes. For Greenplum, we have to subtract the subfield from one part to another part, whereas for Amazon Redshift we can use commas to separate the two dates.
Extract ISOYEAR from date
ISOYEAR 8601 is a week-numbering year. It begins with the Monday of the week containing the 4th of January. So for the date of early January or late December, the ISO year may be different from the Gregorian year. ISO year has 52 or 53 full weeks (364 or 371 days). The extra week is called a leap week; a year with such a week is called a leap year.
The following Greenplum query displays the ISOYEAR 2020:
The following Amazon Redshift query displays the ISOYEAR 2020:
Function to generate_series()
Greenplum has adopted the PostgreSQL function generate_series()
. But the generate_series
function works differently with Amazon Redshift while retrieving records from the table because it’s a leader node-only function.
To display a series of numbers in Amazon Redshift, run the following query on the leader node. In this example, it displays 10 rows, numbered 1–10:
To display a series of days for a given date, use the following query. It extracts the day from the given date and subtracts 1, to display a series of numbers from 0–6:
But for the queries fetching the record from the table, joining with another table’s row, and processing data at the compute node, it doesn’t work, and generates an error message with Invalid Operation. The following code is an example of a SQL statement that works for Greenplum but fails for Amazon Redshift:
For Amazon Redshift, the solution is to create a table to store the series data, and rewrite the code as follows:
Working with regular expressions (regex functions)
Amazon Redshift and Greenplum both support three conditions for pattern matching:
- LIKE
- SIMILAR TO
- POSIX operators
In this post, we don’t discuss all of these pattern matching in detail. Instead, we discuss a few regex functions and regex escape characters that aren’t supported by Amazon Redshift.
Regexp_split_to_table function
The Regex_split_to_table function splits a string using a POSIX regular expression pattern as delimiter.
This function has the following syntax:
For Greenplum, we use the following query:
For Amazon Redshift, the regexp_split_to_table
function has to be converted using the Amazon Redshift split_part
function:
Another way to convert regexp_split_to_table
is as follows:
Substring from regex expressions
Substring (the string from the regex pattern) extracts the substring or value matching the pattern that is passed on. If there is no match, null is returned. For more information, refer to Pattern Matching.
We use the following code in Greenplum:
We can use the regexp_substr function to convert this code to Amazon Redshift. It returns the characters extracted from a string by searching for a regular expression pattern. The syntax is as follows:
Key points while converting regular expression escapes
The Postgres escape character E doesn’t work in Amazon Redshift. Additionally, the following Greenplum regular expression constraints aren’t supported in Amazon Redshift:
- \m – Matches only at the beginning of a word
- \y – Matches only at the beginning or end of a word
For Amazon Redshift, use \\< and \\>, or [[:<:]] and [[:>:]] instead.
Use the following code for Greenplum:
Use the following code for Amazon Redshift:
OR
Conclusion
For heterogeneous database migration from Greenplum to the Amazon Redshift, you can use AWS DMS and the AWS SCT to migrate most of the database objects, such as tables, views, stored procedures, and functions.
There are some situations in which one function is used for the source environment, and the target environment doesn’t support the same function. In this case, manual conversion is required to produce the same results set and complete the database migration.
In some cases, use of a new window function supported by the target environment proves more efficient for analytical queries to process petabytes of data.
This post included several situations where manual code conversion is required, which also improves the code efficiency and make queries efficient.
If you have any questions or suggestions, please share your feedback.
About the Authors
Jagrit Shrestha is a Database consultant at Amazon Web Services (AWS). He works as a database specialist helping customers migrate their on-premises database workloads to AWS and provide technical guidance.
Ishwar Adhikary is a Database Consultant at Amazon Web Services (AWS). He works closely with customers to modernize their database and application infrastructures. His focus area is migration of relational databases from On-premise data center to AWS Cloud.
Shrenik Parekh works as a Database Consultants at Amazon Web Services (AWS). He is expertise in database migration assessment, database migration, modernizing database environment with purpose-built database using AWS cloud database services. He is also focused on AWS web services for data analytics. In his spare time, he loves hiking, yoga and other outdoor activities.
Santhosh Meenhallimath is a Data Architect at AWS. He works on building analytical solutions, building data lakes and migrate Database into AWS.