Why are the source filter conditions in my AWS DMS task not working?

Last updated: 2022-07-29

I have configured source filter conditions on my AWS Database Migration Service (AWS DMS) task, but they aren't working correctly. How can I troubleshoot issues with source filter conditions on my AWS DMS task?

Short description

There are several reasons why source filter conditions might not be working as expected on an AWS DMS task. For example, the engine that you're using might not support the use of source filter conditions. Or, you might be affected by one or more of the limitations of the feature.

Resolution

Check if your engine supports the source filtering feature

Although most AWS DMS sources support source filtering, some sources such as MongoDB or Amazon DocumentDB don't support this feature. Check the AWS DMS documentation about Sources for data migration to confirm whether there are limitations to the source engine that you're using.

Review the AWS DMS source filtering limitations

There are several limitations associated with source filtering. Review the limitations to check that you're using the feature correctly:

  • Filters don't calculate columns of right-to-left languages.
  • Filters can't be applied to large object (LOB) columns.
  • Filters can be applied only to immutable columns that aren't updated after creating. If you apply source filters to mutable columns that can be updated after creation, then the source filters might not work as intended.

Troubleshoot filters that stop working during full load

If your issue persists, check to see at what phase that source filtering stops working.

If filtering doesn't work during full load, follow these steps:

1.    Confirm that any case sensitivity in the mapping rules match the source engine. For example, object names in Oracle and DB2 are upper case, by default. Similarly, object names in PostgreSQL are lower case. Make sure that the column that you are using in your filtering condition matches any case sensitivity required by your source engine.

2.    When you filter with date datatypes, check that you are using the formats that AWS DMS requires. For example, AWS DMS uses the date format YYYY-MM-DD and the time format YYYY-MM-DD HH:MM:SS for filtering.

3.    Reproduce the issue with debut logging level on SOURCE_UNLOAD. Then, capture the query that AWS DMS runs on the source to unload the data. This is an example of a filtering problem on a source Oracle table.

Table details:

CREATE TABLE DMS.FILTERS
( ID NUMBER(10) NOT NULL,
  ENTRY_DATE DATE,
  CONSTRAINT FILTERS_PK PRIMARY KEY (ID)
);

SQL> SELECT * FROM FILTERS;
  ID       ENTRY_DATE
---------- ---------
         1 01-JAN-22
         2 01-JUN-22
         3 01-JAN-21
         4 01-JUN-21
         5 01-JAN-20
         6 01-JUN-20

An AWS DMS task is created with the mapping rules to replicate only rows with ENTRY_DATE greater than or equal to 01/01/2022:

{
  "rules": [
    {
      "rule-type": "selection",
      "rule-id": "893662253",
      "rule-name": "893662253",
      "object-locator": {
        "schema-name": "DMS",
        "table-name": "FILTERS"
      },
      "rule-action": "include",
      "filters": [
        {
          "filter-type": "source",
          "column-name": "ENTRY_DATE",
          "filter-conditions": [
            {
              "filter-operator": "gte",
              "value": "01/01/2022"
            }
          ]
        }
      ]
    }
  ]
}

So, no records are replicated, and the task logs shows these errors:

01786264: 2022-06-22T10:36:53 [SOURCE_UNLOAD   ]E:  ORA-01843: not a valid month  [1020417]  (oracle_endpoint_unload.c:171)

Because debug logs are turned on for SOURCE_UNLOAD, the task logs show the exact query that AWS DMS runs on the source database:

1786264: 2022-06-22T10:36:53 [SOURCE_UNLOAD   ]D:  Select statement for UNLOAD is 'SELECT "ID","ENTRY_DATE"  FROM "DMS"."FILTERS" WHERE ((("ENTRY_DATE" >= TO_DATE('0000-00-00','YYYY-MM-DD'))))'  (oracle_endpoint_utils.c:1979)

In the log output, you can see that AWS DMS runs this query on the source database:

SELECT "ID","ENTRY_DATE"  FROM "DMS"."FILTERS" WHERE ((("ENTRY_DATE" >= TO_DATE('0000-00-00','YYYY-MM-DD'))));

AWS DMS doesn't recognize the date provided in the mapping rules because it doesn't match the date format that AWS DMS expects. So, modify the mapping rules to match the expected date format:

{
                            "filter-operator": "gte",
                            "value": "2022-01-01"
                        }

Troubleshoot filters that stop working during CDC

Note: Apply filters only to immutable columns that aren't updated after creation. Adding filters on changeable columns can cause unexpected results.

If the columns that you use in filtering are immutable columns, you might note that filtering issues happen only during the CDC phase. Also, these issues might occur only on specific DML statements like UPDATES or DELETES. In this scenario, make sure that you have enabled sufficient logging on the source table. Follow these steps to allocate additional logging, depending on the engine you are using:

Oracle

Oracle uses supplemental logging to add additional logs on table columns. If the column that you're filtering isn't a primary key column, make sure that supplemental logging is activated for that column and primary key columns.

This example replicates a table named TEST.LOGGING with a primary key ID and a filter on the column NAME. Run a command similar to the following to create the log group supplemental logging:

ALTER TABLE TEST.LOGGING ADD SUPPLEMENTAL LOG GROUP TEST_LOG_GROUP (ID, NAME) ALWAYS;

If supplemental logging is already added on all columns in a table, such as in this example, then you don't need to add more logging.

ALTER TABLE TableName ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;

PostgreSQL

PostgreSQL uses the REPLICA IDENTITY property to configure the logging level for a table. When REPLICA IDENTITY is set to default, then PostgreSQL records the old values of the columns of the primary key in WAL logs. But, the default logging level might not be sufficient for DELETES when you use a non-primary key column in filtering. In this scenario, perform the following steps depending on the plugin that the AWS DMS task is using: If test_decoding is used:

Set REPLICA IDENTITY to full on the table. If you don't complete this step, then AWS DMS might send all the deletes to target table:

ALTER TABLE tablename REPLICA IDENTITY FULL;

If pglogical is used:

Set REPLICA IDENTITY to full after the table is added to the replication set. This is because pglogical has a limitation that prevents the table from being added to a replication set if REPLICA IDENTITY is full.

ALTER TABLE tablename REPLICA IDENTITY FULL;

Note: Setting REPLICA IDENTITY to full on a table increases the number of WAL logs that are generated on the source database. This happens because all columns are then logged in WAL.

SQL Server

If you are using SQL Server, check that all of the AWS DMS CDC prerequisites are met, specifically these logging requirements:

  • For every table with a primary key, run this query to turn on MS-CDC:

    Note: If MS-Replication is already used for on-premises sources, then skip this step. This step is required for cloud-based sources.

exec sys.sp_cdc_enable_table
@source_schema = N'schema_name',
@source_name = N'table_name',
@role_name = NULL,
@supports_net_changes = 1
GO
  • Run this query to turn on MS-CDC for every table with unique keys but no primary key.

    Note: This step is required for both on-premises and cloud-based sources.

  • exec sys.sp_cdc_enable_table
    @source_schema = N'schema_name',
    @source_name = N'table_name',
    @index_name = N'unique_index_name',
    @role_name = NULL,
    @supports_net_changes = 1
    GO
  • Run this query to turn on MS-CDC for every table with no primary or unique keys/

    Note: This step is required for both on-premises and cloud-based sources.

  • exec sys.sp_cdc_enable_table
    @source_schema = N'schema_name',
    @source_name = N'table_name',
    @role_name = NULL
    GO

    MySQL

    In MySQL, row images in binlogs are controlled by the binlog_row_image system variable. AWS DMS requires that binlog_row_image is set to full and binlog_format is set to ROW. This means that MySQL logs all columns in both the before image and the after image. So, make sure that binlog_row_image is set to full on the source database to confirm the maximum logging level in the binlogs.