AWS Big Data Blog
Accelerate your data warehouse migration to Amazon Redshift – Part 7
Tens of thousands of customers use Amazon Redshift to gain business insights from their data. With Amazon Redshift, you can use standard SQL to query data across your data warehouse, operational data stores, and data lake. You can also integrate other AWS services such as Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to use all the analytic capabilities in AWS.
Migrating a data warehouse can be complex. You have to migrate terabytes or petabytes of data from your legacy system while not disrupting your production workload. You also need to ensure that the new target data warehouse is consistent with upstream data changes so that business reporting can continue uninterrupted when you cut over to the new platform.
Previously, there were two main strategies to maintain data consistency after the initial bulk load during a migration to Amazon Redshift. You could identify the changed rows, perhaps using a filter on update timestamps, or you could modify your extract, transform, and load (ETL) process to write to both the source and target databases. Both of these options require manual effort to implement and increase the cost and risk of the migration project.
AWS Schema Conversion Tool (AWS SCT) could help you with initial bulk load from Azure Synapse Analytics, BigQuery, Greenplum Database, IBM Netezza, Microsoft SQL Server, Oracle, Snowflake, Teradata and Vertica. Now, we’re happy to share that AWS SCT has automated maintaining data consistency for you. If you’re migrating from an IBM Netezza data warehouse to Amazon Redshift, the AWS SCT data extractors will automatically capture changes from the source and apply them on the target. You configure a change data capture (CDC) migration task in AWS SCT, and it will extract the relevant data changes from IBM Netezza and apply them in a transactionally consistent order on Amazon Redshift. You need to configure the needed resources on IBM Netezza and start the data migration—the source database remains fully operational during the migration and replication.
In this post, we describe at a high-level how CDC tasks work in AWS SCT. Then we deep dive into an example of how to configure, start, and manage a CDC migration task. We look briefly at performance and how you can tune a CDC migration, and then conclude with some information about how you can get started on your own migration.
Accelerate your data warehouse migration to Amazon Redshift:
|
Solution overview
The following diagram shows the data migration and replication workflow with AWS SCT.
In the first step, your AWS SCT data extraction agent completes the full load of your source data to Amazon Redshift. Then the AWS SCT data extraction agent uses a history database in Netezza. The history database captures information about user activity such as queries, query plans, table access, column access, session creation, and failed authentication requests. The data extraction agent extracts information about transactions that you run in your source Netezza database and replicates them to your target Redshift database.
You can start ongoing replication automatically after you complete the full load. Alternatively, you can start CDC at a later time or on a schedule. For more information, refer to Configuring ongoing data replication.
At a high level, the ongoing replication flow is as follows.
At the start of the replication, the data extraction agent captures the last transaction identifier in the history table. The data extraction agent stores this value in the max_createxid
variable. To capture the transaction ID, the agent runs the following query:
If this transaction ID value is different from the CDC start point, then the agent identifies the delta to replicate. This delta includes all transactions for the selected tables that happened after full load or after the previous replication. The data extraction agent selects the updated data from your source table.
From this updated data, AWS SCT creates two temporary tables. The first table includes all rows that you deleted from your source database and the old data of the rows that you updated. The second table includes all rows that you inserted and the new data of the rows that you updated. AWS SCT then uses these tables in JOIN clauses to replicate the changes to your target Redshift database.
Next, AWS SCT copies these tables to your Amazon Simple Storage Service (Amazon S3) bucket and uses this data to update your target Redshift cluster.
After updating your target database, AWS SCT deletes these temporary tables. Next, your data extraction agent sets the value of the CDC start point equal to the captured transaction ID (max_createxid
). During the next data replication run, your agent will determine the delta to replicate using this updated CDC start point.
All changes that happen to your source database during the replication run will be captured in the next replication run. Make sure that you repeat the replication steps until the delta is equal to zero for each table that you included in the migration scope. At this point, you can cut over to your new databases.
Configure your source database
In your source Netezza database, create a history database and configure the history logging. Next, grant read permissions for all tables in the history database to the user that you use in the AWS SCT project. This user has the minimal permissions that are required to convert your source database schemas to Amazon Redshift.
Configure AWS SCT
Before you start data migration and replication with AWS SCT, make sure that you download the Netezza and Amazon Redshift drivers. Take note of the path to the folder where you saved these files. You will specify this path in the AWS SCT and data extraction agent settings.
To make sure the data extraction agents work properly, install the latest version of Amazon Corretto 11.
To configure AWS SCT, complete the following steps:
- After you create an AWS SCT project, connect to your source and target databases and set up the mapping rules. A mapping rule describes a source-target pair that defines the migration target for your source database schema.
- Convert your database schemas and apply them to Amazon Redshift if you haven’t done this yet. Make sure that the target tables exist in your Redshift database before you start data migration.
- Now, install the data extraction agent. The AWS SCT installer includes the installation files for data extraction agents in the agents folder. Configure your data extraction agents by adding the listening port number and the path to the source and target database drivers. For the listening port, you can proceed with the default value. For database drivers, enter the path that you noted before.
The following diagram shows how the AWS SCT data extraction agents work.
After you install the data extraction agent, register it in AWS SCT.
- Open the data migration view in AWS SCT and choose Register.
- Enter the name of your agent, the host name, and the port that you configured in the previous step. For the host name, you can use the localhost 0.0.0.0 if you run the agent on the same machine where you installed the data extraction agent.
Create and run a CDC task
Now you can create and manage your data migration and replication tasks. To do so, complete the following steps:
- Select the tables in your source database to migrate, open the context (right-click) menu, and choose Create local task.
- Choose your data migration mode (for this post, choose Extract, upload and copy to replicate data changes from your source database):
- Extract only – Extract your data and save it to your local working folders.
- Extract and upload – Extract your data and upload it to Amazon S3.
- Extract, upload and copy – Extract your data, upload it to Amazon S3, and copy it into your Redshift data warehouse.
- Choose your encryption type. Make sure that you configure encryption for safe and secure data migrations.
- Select Enable CDC.
- After this, you can switch to the CDC settings tab.
- For CDC mode, you can choose from the following options:
- Migrate existing data and replicate ongoing changes – Migrate all existing source data and then start the replication. This is the default option.
- Replicate data changes only – Start data replication immediately.
Sometimes you don’t need to migrate all existing source data. For example, if you have already migrated your data, you can start the data replication by choosing Replicate data changes only.
- If you choose Replicate data changes only, you can also set the Last CDC point to configure the replication to start from this point. If it is not set, AWS SCT data extraction agents replicate all changes that occur after your replication task is started.
If your replication task failed, you can restart the replication from the point of failure. You can find the identifier of the last migrated CDC point on the CDC processing details tab in AWS SCT, set Last CDC point and start the task again. This will allow AWS SCT data extraction agents to replicate all changes in your source tables to your target database without gaps.
- You can also configure when you want to schedule the CDC runs to begin.
If you select Immediately, the first replication run immediately after your agent completes the full load. Alternatively, you can specify the time and date when you want to start the replication.
- Also, you can schedule when to run the replication again. You can enter the number of days, hours, or minutes when to repeat the replication runs. Set these values depending on the intensity of data changes in your source database.
- Finally, you can set the end date when AWS SCT will stop running the replication.
- On the Amazon S3 settings tab, you can connect your AWS SCT data extraction agent with your Amazon S3 bucket.
You don’t need to do this step if you have configured the AWS service profile in the global application settings.
- After you have configured all settings, choose Create to create a CDC task.
- Start this task in the AWS SCT user interface.
The following screenshots show examples of the AWS SCT user interface once you started tasks.
You can run multiple CDC tasks in parallel at the same time. For example, you can include different sets of source tables in each task to replicate the changes to different target Redshift clusters. AWS SCT handles these replication tasks and distributes resources correctly to minimize the replication time.
Data replication limitations
There are a few limitations in AWS SCT data replication:
- Changes in your source database don’t trigger the replication run because AWS SCT isn’t able to automate these runs (as of this writing). You can instead run the data replication tasks on a predefined schedule.
- AWS SCT doesn’t replicate TRUNCATE and DDL statements. If you change the structure of your source table or truncate it, then you must run the same statements in your target database. You should make these changes manually because AWS SCT isn’t aware of structure updates.
End-to-end example
Now that you know how to create a local replication task in AWS SCT, we deep dive and show how AWS SCT performs the extract and load processes.
- First, we run the following code to check that we correctly configured our source Netezza database. To use this code example, change the name of your history database.
If you configured your database correctly, then the output of this command includes a value that is different from zero. In our case, the result is as follows:
- Now we create a table on Netezza to use in the example. The table has three columns and a primary key.
- The SELECT statement returns an empty table:
- Before we start the replication, we run the following query on Netezza to get the latest transaction identifier in the history table:
For our test table, the script prints the last transaction identifier, which is 2691798
:
- To make sure that our table doesn’t include new transactions, AWS SCT runs the following script. If you want to run this script manually, replace
2691798
with the last transaction identifier in your history table.
As expected, the script doesn’t return any values.
CREATEXID
and DELETEXID
are hidden system columns that exist in every Netezza table. CREATEXID
identifies the transaction ID that created the row, and DELETEXID
identifies the transaction ID that deleted the row. AWS SCT uses them to find changes in the source data.
Now we’re ready to start the replication.
- We assume you’ve used AWS SCT to convert the example table and build it on the target Amazon Redshift. AWS SCT runs the following statement on Amazon Redshift:
- AWS SCT also creates a staging table that holds replication changes until they can be applied on the actual target table:
- AWS SCT runs the following query to capture all changes that happened after the last transaction identifier:
This script returns an empty table:
- Now, we change data on Netezza and see how it gets replicated to Amazon Redshift:
The preceding script returns the following result:
- AWS SCT checks for data changes starting from the last transaction ID using the following query:
The script returns a result that is different from zero:
Because the new transaction ID is greater than the last transaction ID, the history table contains new data to be replicated.
- AWS SCT runs the following query to extract the changes. The application detects all rows that were inserted, deleted, or updated within the scope of transactions with IDs in range from 2691798 + 1 to 2691824.
The extracted data is as follows:
- Next, AWS SCT compresses the data and uploads it to Amazon S3. Then AWS SCT runs the following command to copy the data into the staging table on Amazon Redshift:
- From the staging table, AWS SCT applies the changes to the actual target table. For this iteration, we insert new rows into the Redshift table:
- Let’s run another script that not only inserts, but also deletes and updates data in the source table:
The Netezza table contains the following rows:
- AWS SCT detects the changes as before using the new transaction ID:
The extracted changes appear as follows:
Notice that we inserted a new row with ID 7 and then deleted this row. Therefore, we can ignore the row with ID 7 in our delta.
Also, we made several updates of the row with ID 4. In our delta, we include the original and the most recent versions of the row. We ignore all intermediate versions in our delta.
We updated the row with ID 5 and then deleted this row. We don’t include the updated row in our delta.
This way, AWS SCT optimizes the migrated data, reducing the migration time and the network traffic.
- Now, as before, AWS SCT compresses, uploads to Amazon S3, and copies the data into the staging Redshift table:
- Then, AWS SCT applies the changes to the target table. AWS SCT removes the deleted rows, removes the old version of updated rows, and then inserts new rows and the most recent version of any updated rows:
- You can compare the data on the source and target to verify that AWS SCT captured all changes correctly:
The data on Amazon Redshift matches exactly:
In the previous examples, we showed how to run full load and CDC tasks. You can also create a CDC migration task without the full load. The process is the same—you provide AWS SCT with the transaction ID to start the replication from.
The CDC process does not have a significant impact on the source side. AWS SCT runs only SELECT statements there, using the transaction ID as boundaries for the WHERE clause. The performance impact of these statements is always smaller than the impact of DML statements generated by customer’s applications. For machines where AWS SCT data extraction agents are running, the CDC-related workload is always smaller than the full load workload because the volume of transferred data is smaller.
On the target side, for Amazon Redshift, the CDC process can generate considerable additional workload. The reason is that this process issues INSERT and DELETE statements and these can result in overhead for MPP systems, which Amazon Redshift is. Refer to Top 10 performance tuning techniques for Amazon Redshift to find best practices and tips on how to boost performance of your Redshift cluster.
Conclusion
In this post, we showed how to configure ongoing data replication for Netezza database migration to Amazon Redshift. You can use the described approach to automate data migration and replication from your IBM Netezza database to Amazon Redshift. Or, if you’re considering a migration of your existing Netezza workloads to Amazon Redshift, you can use AWS SCT to automatically convert your database schemas and migrate data. Download the latest version of AWS SCT and give it a try!
We’re happy to share these updates to help you in your data warehouse migration projects. In the meantime, you can learn more about Amazon Redshift and AWS SCT. Happy migrating!
About the Authors
Mykhailo Kondak is a Database Engineer in the AWS Database Migration Service team at AWS. He uses his experience with different database technologies to help Amazon customers to move their on-premises data warehouses and big data workloads to the AWS Cloud. In his spare time, he plays soccer.
Illia Kravtsov is a Database Engineer on the AWS Database Migration Service team. He has over 10 years of experience in data warehouse development with Teradata and other massively parallel processing (MPP) databases.
Michael Soo is a Principal Database Engineer in the AWS Database Migration Service. He builds products and services that help customers migrate their database workloads to the AWS Cloud.