AWS Database Blog
Migration Validation (Part 2) – Introducing Data Validation in AWS Database Migration Service
AWS Database Migration Service (AWS DMS) helps you migrate databases to AWS quickly and securely. You can migrate your data to and from most widely used commercial and open source databases, such as Oracle, Microsoft SQL Server, and PostgreSQL. The service supports homogeneous migrations such as Oracle to Oracle, and also heterogeneous migrations between different database platforms, such as Oracle to PostgreSQL or MySQL to Oracle. In addition, it now provides a mechanism to validate the data after the migration.
This post gives you a quick overview of how you can create migration tasks that use the data validation feature. You can create these tasks by using the AWS DMS console.
The AWS DMS migration process consists of setting up a replication instance, source and target endpoints, and a replication task. The replication task runs on the replication instance and migrates data from the source endpoint to the target endpoint.
To create migration tasks, you can use the AWS Management Console or the AWS Command Line Interface (AWS CLI). If you are a first-time AWS CLI user, we recommend that you read the following documentation and get accustomed to configuring and using the AWS CLI:
- Create an AWS Identity and Access Management (IAM) user
- Set up required permissions for the IAM user
- Set up the roles required to use AWS DMS
- Set up the AWS CLI
Problem/use case
AWS DMS supports migrating homogeneous and heterogeneous databases. But many customers are looking for a way to validate the data after the migration. They are replicating their production databases. Before doing the cut over, they want to have high confidence in their migration by comparing the data between the source and target.
Customers use continuous replication to replicate data into different databases (for example, Oracle to PostgreSQL). Because it is their production system, they want to ensure that the data is being migrated without any loss or corruption.
Solution
Now AWS DMS provides a way to validate replicated data between two databases using a data validation feature.
To validate the replicated data, AWS DMS compares the data between the source and target to ensure that the data is the same. It makes appropriate queries to the source and the target to retrieve the data. If the data volume is large, AWS DMS also adopts a strategy to split the data into smaller manageable units and compare them. AWS DMS splits the table into multiple smaller groups of contiguous rows based on the primary key, which is known as a partition. It then compares the data at the partition level and exposes results of the final comparison to you. This partitioning approach helps AWS DMS compare a predefined amount of data at any point in time.
You can use the results of the comparison to determine whether a significant difference exists between the source and target. You can enable this data validation feature when you create the AWS DMS task on the AWS DMS console.
Creating an AWS DMS data validation task using the console
To enable the data validation feature using the AWS DMS console, when you create an AWS DMS task, select Enable validation under Task Settings.
On the Table statistics tab, you can check the result of the data validation during full load and change processing.
To understand the results of the data validation, see the next section.
Checking results of the data validation
The following metrics describe the results of data validation.
Validation state
- Validated: Indicates the records that are currently in sync between the source and target.
- Mismatched records: Indicates the number of unmatched records after applying all known updates to the target. A future update on the source might change a record from unmatched to matched.
- ValidationPendingRecords: Indicates how many records are yet to be validated.
- ValidationFailedRecords: Indicates how many records failed the validation, which means that the values are not matching between the source and target.
- ValidationSuspendedRecords: Indicates how many records are suspended. This can happen when rows are too busy and continue to be modified, so AWS DMS can’t compare them.
You can also enable data validation using AWS CLI commands. To do this, add the following validation settings to the task settings JSON:
After you start the AWS DMS task with data validation set using the AWS CLI, you can view the validation results at a table level using the AWS CLI DescribeTableStatistics
API call.
Failure details of the validation are stored on the target database in a table named aws_dms_validation_failures
. This table is similar to the use of the aws_dms_exceptions
table for storing exception details in applying the DML. Basically, this table stores the failure type, primary key value for a single failed record, or the start or end key values for a group of failed records.
Validation failure table definitions
The following table shows the structure of the aws_dms_validation_failures
table.
Column | Description |
TaskIdentifier | The identifier of the task. |
TableName | The name of the table. |
SchemaName | The name of the schema. |
RuleId | The identifier of the validation rule. |
StartKey | The primary key for row record type. If this is a range key, it is the start of the range. It is represented as a JSON string; the key in the JSON is the column name, and the value is the corresponding value. The JSON contains all columns in case of composite keys. If the primary key is a binary object, its value is represented as a base64-encoded string in the JSON. |
EndKey | Similar to the start key, except it is present only for the Range record type and is the end key of the range. |
RecordType | Can be either Row or Range. |
RecordCount | Represents the total number of records in the range. It doesn’t mean that they are not in sync or it can’t compare them; it just means that out of these records, some of them (we don’t know exactly how many of them) are out of sync or can’t be compared. For row records type, the value is always 1. |
Failure Type | Can be either OutofSync or CannotCompare. |
When you find mismatched records, it’s important to understand what caused it and resolve any issues. A mismatch can happen for several reasons:
- An update failing to apply to the target due to constraint violations, type conversion issues, etc.
- A direct update in the target
- Other unknown reasons
We can reload the table that reports OutOfSync
records. Generally, if OutOfSync
records are not caused by direct updates in the target or other known issues such as a constraint violation, you should engage the AWS Support team for further investigation.
Limitations
Note the following limitations when you are working with the AWS DMS data validation feature:
- Validation works only for tables with a primary key/unique index.
- If the target database is modified outside of AWS DMS, validation might fail to report the discrepancies accurately.
- If the same row or set of rows are modified continuously, validation can’t validate those busy rows, and you must check the busy rows themselves.
- You might not be able to compare all different types of data. For example, AWS DMS might not support comparing blob types as they are, or the comparison of some real/float numbers, etc.
- One of the current limitations with primary key constraints is that the length must be less than 1,024.
Summary
Data validation in AWS DMS is a great feature that can help you gain a high level of confidence in your migrations. It enables you to figure out how much of your data has successfully been replicated and validated. Data validation provides metrics, such as the total number of replicated rows, how many rows were compared, how many are in sync, how many are out of sync, and how many rows are not comparable because of inflight changes.
AWS DMS also provides a new migration assessment feature to help you validate your settings before running your migration tasks. For more information, see Migration Validation (Part 1) – Introducing Migration Assessment in AWS Database Migration Service.
Good luck, and happy migrating!
About the Author
Mahesh Kansara is a cloud support engineer at Amazon Web Services. He works with our customers to provide guidance and technical assistance on various database and analytical projects, helping them improving the value of their solutions when using AWS.