AWS Database Blog

AWS Database Migration Service improves migration speeds by adding support for parallel full load and new LOB migration mechanisms

We’re happy to introduce AWS DMS replication engine version 3.1.2. The new version includes a better UX and performance improvements that were requested by many customers. We listened and delivered on our promise to make DMS better. In this blog post, I’ll cover few key new features. For the entire list, see the AWS DMS release notes, which includes detailed information about features and bug fixes for this and previous versions of DMS.

DMS replication engine version 3.1.2 new features:

  • UTF-8 4-byte character set support
  • Improved full load performance for partitioned and sub partitioned tables
  • Improved large objects (LOB) performance
  • Table load order during full load
  • Support for the update of primary key values on a PostgreSQL source 

The summaries in this blog post include tests and examples that you can run on your own. To do this you’ll need the following AWS resources:

  • AWS Account
  • AWS Database Migration Service
  • Oracle database as a source
  • PostgreSQL database as a target

UTF-8 4-byte character set support

 In previous versions of AWS DMS UTF-8 4-byte character sets were not supported. For example, U+FF83 テ, U+FF8C フ, U+FF9A レ, U+00AE ®, U+20AC €, U+4E9A 亚, or U+F900 豈 would cause unexpected behavior during a migration. Migration tasks would fail with “invalid byte sequence” errors when 4-byte characters were encountered. 

Here is an example of such an error. 

115345152: 2017-11-06T06:47:24 [TARGET_LOAD ]E: Failed to wait for previous run (csv_target.c:937) 
115345152: 2017-11-06T06:47:24 [TARGET_LOAD ]E: Failed to start load process for file '1201' [1020403] (csv_target.c:1300) 
115345152: 2017-11-06T06:47:24 [TARGET_LOAD ]E: Failed to load file '1201' [1020403] (csv_target.c:1377) 
115345152: 2017-11-06T06:47:24 [TARGET_LOAD ]D: Load command output: psql: /usr/lib64/libcom_err.so.2: no version information available (required by libgssapi_krb5.so.2) 
psql: /usr/lib64/libcom_err.so.2: no version information available (required by r2.3.0.R1/lib/libkrb5.so.3) 
ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xb1 0x80 
CONTEXT: COPY accesses, line 83299 (csv_target.c:895) 

With version 3.1.2, you can migrate 4-byte characters without any issue!

I’ll perform a quick test with DMS version 3.1.2 to show how you can migrate UTF-8 4-byte characters.

Run the following commands at source an Oracle database to create a table and insert records with 4-byte UTF-8 characters in your Oracle source database.

CREATE TABLE UTF8MB4_Support (Id Int Primary Key, CharData Varchar2(1000));

INSERT INTO UTF8MB4_Support (Id,CharData) values ('1', 'This is a 4-byte character 🌸 U+1F338 CHERRY BLOSSOM');

INSERT INTO UTF8MB4_Support (Id,CharData) values ('2','This is a rose 🌹');

INSERT INTO UTF8MB4_Support (id,CharData) values ('3', 'This is a sun flower 🌻');

INSERT INTO UTF8MB4_Support (Id,CharData) values ('4', 'This character is causing an issue 𐌸 end');

INSERT INTO UTF8MB4_Support (Id,CharData) values ('5', 'This character is causing an issue 𐄏 end');

COMMIT;

Create a DMS task to migrate the source table that contains utf8mb characters.

Run the following command to query the records on the target database table.

SELECT "ID", "CHARDATA"
  FROM “UTF8MB4_SUPPORT" order by "ID";

The following table shows records migrated with UTF-8 4-byte characters migrated to the target database table.

Improved full load performance for partitioned and sub-partitioned tables

AWS DMS version 3.1.2 now supports the parallel full load feature. During a full load migration task, you can accelerate the migration of large or partitioned tables by splitting the table into segments and loading the segments in-parallel in the same migration task.

To use parallel loading, create a rule of type table-settings with the parallel-load option. Within the table-settings rule, specify the selection criteria for the table or tables you want to load in-parallel. To specify the selection criteria, set the type element for parallel-load to be one of the following. The AWS DMS documentation includes detailed information about these settings.

  • partitions-auto
  • subpartitions-auto
  • range-segmentation

The task must be defined with a combination of the following source and target endpoints.

Source endpoints: Oracle, Microsoft SQL Server, MySQL, PostgreSQL, IBM Db2 for LUW, SAP Sybase ASE

Target endpoints: Oracle, Microsoft SQL Server, PostgreSQL, Sybase ASE, MySQL, and Amazon Redshift

Setting up parallel load example

I’ll use the examples below to perform a parallel full load of tables during a migration. In this example, I create a list of partitioned tables in Oracle and migrate it using the partitions-auto parallel load method.

SQL> CREATE TABLE HR.SALES
   (
    SALES_NO NUMBER, 
    DEPT_NAME VARCHAR2 (20),
    SALE_AMOUNT NUMBER (10, 2),
    SALE_DATE DATE,
    REGION VARCHAR2 (10)
   )
   PARTITION BY LIST (REGION)
      (
       PARTITION P1 VALUES ('NORTH'),
       PARTITION P2 VALUES ('SOUTH'),
       PARTITION P3 VALUES ('EAST'),
       PARTITION P4 VALUES ('WEST')
     );

I’ll insert the example data in the source Oracle table by using the statements below.

SQL> insert into HR.SALES values (1,'SALES', 2000.25,'01-OCT-16','NORTH');
SQL> insert into HR.SALES values (2,'SALES', 6000.25,'01-OCT-16','SOUTH');
SQL> insert into HR.SALES values (3,'SALES', 8000.25,'01-OCT-16','EAST');
SQL> insert into HR.SALES values (4,'SALES', 12000.25,'01-OCT-16','WEST');
SQL> commit;

Now I create a migration task with the table-mapping JSON below that includes the new table-settings rule.

{
  "rules": [
    {
      "rule-type": "selection",
      "rule-id": "1",
      "rule-name": "1",
      "object-locator": {
        "schema-name": "HR",
        "table-name": "SALES"
      },
      "rule-action": "include"
    },
    {
      "rule-type": "table-settings",
      "rule-id": "2",
      "rule-name": "2",
      "object-locator": {
        "schema-name": "HR",
        "table-name": "SALES"
      },
      "parallel-load": {
        "type": "partitions-auto"
      }
    }
  ]
}

After the DMS task is complete, I can verify that the tables were loaded in parallel from the logs.

[TASK_MANAGER   ] I:  Start loading segment #1 of 4 of table 'HR'.'SALES' (Id = 1) by subtask 1. Start load timestamp 000575692249E31C
[SOURCE_UNLOAD  ] I:  Unload finished for segment #1 of segmented table 'HR'.'SALES' (Id = 1). 1 rows sent.  
[TARGET_LOAD    ] I:  Load finished for segment #1 of segmented table 'HR'.'SALES' (Id = 1). 1 rows received. 0 rows skipped. Volume transfered 480. 
[TASK_MANAGER   ] I:  Load finished for segment #1 of table 'HR'.'SALES' (Id = 1) by subtask 1. 1 records transferred.  
[TASK_MANAGER   ] I:  Start loading segment #4 of 4 of table 'HR'.'SALES' (Id = 1) by subtask 1. Start load timestamp 00057569224B7585 
[SOURCE_UNLOAD  ] I:  Unload finished for segment #4 of segmented table 'HR'.'SALES' (Id = 1). 1 rows sent.  (streamcomponent.c:3401)
[TARGET_LOAD    ] I:  Load finished for segment #4 of segmented table 'HR'.'SALES' (Id = 1). 1 rows received. 0 rows skipped. Volume transferred 480.

If source table is not partitioned you may use below configuration with the range-segmentation option

{
	"rules": [{
			"rule-type": "selection",
			"rule-id": "1",
			"rule-name": "1",
			"object-locator": {
				"schema-name": "HR",
				"table-name": "SALES"
			},
			"rule-action": "include"
		},
		{
			"rule-type": "table-settings",
			"rule-id": "2",
			"rule-name": "2",
			"object-locator": {
				"schema-name": "HR",
				"table-name": "SALES"
			},
			"parallel-load": {
				"type": "ranges",
				"columns": [
					"SALES_NO",
					"REGION"
				],
				"boundaries": [
					[
						"1000",
						"NORTH"
					],
					[
						"3000",
						"EAST"
					]
				]
			}
		}
	]
}

The above configuration will migrate the data in three segments (three threads in parallel) as below.

Data with SALES_NO <=1000 and REGION = ‘NORTH
Data with SALES_NO >1000 and ID<=3000 and NAME = ‘EAST’
Data with SALES_NO > 3000

Improved LOB performance

This new version of DMS replication engine allows you to improve loading LOB data in couple of ways. Now you can specify different LOB optimization settings while migrating LOB data.

Per table LOB settings

Using this feature, you can override task-level LOB settings for some or all of your tables. To do this, you define the lob-settings in your table-settings rule. Here is an example table that includes some large LOB values.

SET SERVEROUTPUT ON 
CREATE TABLE TEST_CLOB 
  ( 
     ID NUMBER, 
     C1  CLOB, 
     C2  VARCHAR2(4000) 
  ); 
DECLARE 
    bigtextstring CLOB := '123'; 
    i                   INT; 
BEGIN 
    WHILE Length(bigtextstring) <= 60000 LOOP 
        bigtextstring := bigtextstring || '000000000000000000000000000000000'; 
    END LOOP; 
INSERT INTO TEST_CLOB (ID, C1, C2) VALUES (0, bigtextstring,'AnyValue'); 

END; 
/ 
SELECT * FROM   TEST_CLOB; 
COMMIT;

Next I create a migration task and modify the LOB handling for our table using the new lob-settings rule. The bulk-max-size determines the maximum LOB size (KB) and it’s truncated if it’s bigger than size specified.

{
  "rules": [
    {
      "rule-type": "selection",
      "rule-id": "1",
      "rule-name": "1",
      "object-locator": {
        "schema-name": "HR",
        "table-name": "TEST_CLOB"
      },
      "rule-action": "include"
    },
    {
      "rule-type": "table-settings",
      "rule-id": "2",
      "rule-name": "2",
      "object-locator": {
        "schema-name": "HR",
        "table-name": "TEST_CLOB"
      },
      "lob-settings": {
        "mode": "limited",
        "bulk-max-size": 16
      }
    }
  ]
}

Even if this DMS task is created with FullLobMode: true, the per table LOB settings direct DMS to truncate LOB data in this particular table to 16,000. You can check the task logs to confirm this.

721331968: 2018-09-11T19:48:46:979532 [SOURCE_UNLOAD] W:  The value of column 'C' in table 'HR.TEST_CLOB' was truncated to length 16384  

Inline LOB settings.

As you may know, while creating a DMS task, the LOB mode determines how LOBs are handled.

  • Full LOB mode – In full LOB mode, AWS DMS migrates all LOBs from source to target regardless of size. In this configuration, AWS DMS has no information about the maximum size of LOBs to expect. Thus, LOBs are migrated one at a time, piece by piece. Full LOB mode can be slow.
  • Limited LOB mode – In limited LOB mode, you set a maximum size LOB that AWS DMS should accept. Doing so allows AWS DMS to pre-allocate memory and load the LOB data in bulk. LOBs that exceed the maximum LOB size are truncated and a warning is recorded in the log file. In limited LOB mode, you get significant performance gains over full LOB mode.

Full LOB mode and limited LOB, each has their own pros and cons. We listened to our customers and combined the advantages of both LOB modes in inline LOB mode.

Now you can select this option when you need to replicate both small and large LOBs, and most of the LOBs are small. When this option is selected, during full load, AWS DMS task transfers the small LOBs inline, which is more efficient, and the large LOBs by performing a lookup from the source table.

During Change Processing however, both small and large LOBs are replicated by performing a lookup from the source table.
When you choose this option, the AWS DMS task checks all of the LOB sizes to determine which ones to transfer inline. LOBs larger than the specified size are replicated using full LOB mode. Therefore, if you know that most of the LOBs are larger than the specified setting, it is better not to use this option and allow an unlimited LOB size.

You can configure this option using a new attribute in task setting InlineLobMaxSize, which is only available when FullLobMode is set to true. The default value is 0. The range is 1 KB – 2 GB.

For example, you may use below AWS DMS task settings and can specify InlineLobMaxSize to a value of 5 and all LOBs larger than 5,000 are transferred inline.

{
"TargetMetadata": {
"TargetSchema": "",
"SupportLobs": true,
"FullLobMode": false,
"LobChunkSize": 64,
"LimitedSizeLobMode": true,
"LobMaxSize": 32,
"InlineLobMaxSize": 5,
"LoadMaxFileSize": 0,
"ParallelLoadThreads": 0,
"ParallelLoadBufferSize":0,
"BatchApplyEnabled": false,
"TaskRecoveryTableEnabled": false
},

Tables load order during full load

Now, with DMS version 3.1.2, you can control the order in which tables are loaded. This may be useful in some situations, for example, if your selected tables list contains tables of different sizes and you want the smaller tables to be loaded before the larger tables.

This is supported using a new attribute of selection rule called load-order. It’s allowed only for the rule-action include and you should specify a non-negative number. Zero is a default, the maximum is 2147483647 for the load order. Tables with lower load-order numbers begin loading before tables with higher load-order numbers.

For example, if you want to load two tables named BIG_Table and SMALL_Table in a particular order, you can control this using the load-order attributes in the task JSON as specified below. So, in this example, the table SMALL_Table with load order 1 is loaded before table BIG_Table with load order 2.

   {
  "rules": [
    {
      "rule-type": "selection",
      "rule-id": "1",
      "rule-name": "1",
      "object-locator": {
        "schema-name": "HR",
        "table-name": "BIG_Table"
      },
      "rule-action": "include",
      "load-order": 2
    },
    {
      "rule-type": "selection",
      "rule-id": "3",
      "rule-name": "3",
      "object-locator": {
        "schema-name": "HR",
        "table-name": "SMALL_Table"
      },
      "rule-action": "include",
      "load-order": 1
    }
  ]
}

You should be able to confirm this behavior looking at AWS DMS task logs as shown below.

209925094: 2018-09-13T06:53:00 [TABLES_MANAGER  ]I:  Next table to load 'HR'.'BIG_Table' ID = 1, order = 2  (tasktablesmanager.c:1661)
209925094: 2018-09-13T06:53:00 [TASK_MANAGER    ]I:  Start loading table 'HR'.'BIG_Table' (Id = 1) by subtask 1. Start load timestamp 000575BB28605C42  (replicationtask_util.c:707)
-18245163: 2018-09-13T06:53:00 [TARGET_LOAD     ]I:  Target endpoint 'PostgreSQL' is using provider syntax 'PostgreSQLLinux'  
-18245163: 2018-09-13T06:53:00 [TARGET_LOAD     ]I:  postgres_set_encoding(...) Main encoding details: Name='UTF8', Language='all', isServerEncoding?='true' storageFactor=4, Convert
erName='UTF-8'  
-18245163: 2018-09-13T06:53:00 [TARGET_LOAD     ]I:  Transaction size is 10000. Array size is 1000.  
-18245163: 2018-09-13T06:53:00 [SOURCE_CAPTURE  ]I:  Source endpoint 'Oracle' is using provider syntax 'Oracle'  (provider_syntax_manager.c:610)
209925094: 2018-09-13T06:53:00 [TABLES_MANAGER  ]I:  Next table to load 'HR'.'SMALL_Table' ID = 2, order = 1  
209925094: 2018-09-13T06:53:00 [TASK_MANAGER    ]I:  Start loading table 'HR'.'SMALL_Table' (Id = 2) by subtask 2.

When a group of tables are set with the same load order, AWS DMS loads the tables according to the alphabetical order. 

Support the update of primary key values on a PostgreSQL source

Previous versions of DMS would ignore updates to a primary key while working with PostgreSQL as source. For example, if you are running the following statements on a PostgreSQL source and having full load and ongoing replication tasks, you might see the behavior shown here.

Run the following command to create a table and insert a few records in the source PostgreSQL database.

CREATE TABLE Products (
    product_no integer primary key,
    name text,
    price numeric
);

INSERT INTO Products VALUES (1, 'Cheese', 1.99);
INSERT INTO Products VALUES (2, 'Mango', 2.99);

COMMIT;

Run the following command to check inserted records on the source PostgreSQL database.

SELECT * FROM Products

Create an AWS DMS task to migrate existing data and ongoing replication for the preceding table.

Run the following command to update the primary key column on the source table.

UPDATE Products
SET product_no=3
WHERE price=1.99;

COMMIT;

The following snippet of a DMS task log shows the errors. The task ignores the update to the primary key column associated with the update and writes it to the DMS control table ‘awsdms_apply_exceptions’ on the target database.

[TARGET_APPLY    ]W:  Some changes from the source database had no impact when applied to the target database. See 'awsdms_apply_exceptions' table for details.  (endpointshell.c:4999)
188324838: 2018-10-04T03:47:48:993837 [TARGET_APPLY    ]D:  Failed to execute statement. Error is: 0 rows affected
UPDATE  "public"."products" SET "product_no"='3' , "name"='Cheese' , "price"='1.990000' WHERE "product_no"='3'  (endpointshell.c:5003)
[TARGET_APPLY    ]T:  Event received: operation 'COMMIT (7)' event id '9' table id '0' context '00000120/96000248'  (streamcomponent.c:2569)
 [TARGET_APPLY    ]T:  Enter odbc_endpoint_commit_transaction  (odbc_endpoint_imp.c:5786)

Now, with DMS version 3.1.2, you can update the primary key values for PostgreSQL as source. The changes are replicated to the target database server.

Conclusion

With AWS DMS replication engine version 3.1.2, you can use all of the above exciting features. You can migrate 4-byte UTF-8 characters from any source database, use parallel full load to speed up migration, and use optimized LOB settings while dealing with different size of LOBs. You can also specify table load-order during migration, and update the primary key values for PostgreSQL source databases.

If you have questions or suggestions, add your comments below. Good luck and happy migrating!


About the Author

Mahesh Kansara is a database engineer at Amazon Web Services. He works with our customers to provide guidance and technical assistance on various database and analytical projects, helping them improving the value of their solutions when using AWS.