AWS Database Blog

How to use segmentation to improve performance for large MongoDB and Amazon DocumentDB migrations in AWS DMS

When you’re migrating from a MongoDB or Amazon DocumentDB (with MongoDB compatibility) database using AWS Database Migration Service (AWS DMS) in full load mode, a primary consideration is the ability to performantly migrate data. By default, when you provision a replication instance in AWS DMS, it uses a single threaded process to migrate data. A single threaded approach to migrate a large database using AWS DMS might result in the migration being slow, and slow performance can affect your entire migration experience.

We’re excited to announce a new feature of AWS DMS that automatically performs a segmented (multi-threaded) unload from a MongoDB or Amazon DocumentDB collection to any supported target for a full load migration. This is an additional capability to the existing range segmentation feature, which also performs a segmented unload based on the range boundaries provided by you. Similar to the performance of range segmentation, auto segmentation could improve migration performance by up to three times faster.

In this post, we show you how use auto segmentation and range segmentation to migrate data from MongoDB and Amazon DocumentDB source endpoints in AWS DMS.

Prerequisites

You should have a basic understanding of how AWS DMS works. If you’re just getting started with DMS, review the AWS DMS documentation. You should also have a MongoDB or an Amazon DocumentDB source cluster and a supported AWS DMS target to perform a migration.

Using Auto Segmentation With AWS DMS

AWS DMS auto segmentation for MongoDB and Amazon DocumentDB allows you to load data in parallel during the full load phase of the migration based on the segmentation parameters you provide. You can achieve a better performance with auto segmentation when the document sizes in the collection are uniformly distributed and aren’t skewed. When the dataset is skewed and the size of the documents vary greatly in the collection, it might be more beneficial to use range segmentation.

With auto segmentation, you can provision multiple threads, each of which is responsible for transferring a chunk of data from the source collection to the target. AWS DMS computes the scope of each segment (a set of documents defined by ObjectId) to migrate by determining the lower boundary of the segment. AWS DMS collects the total number of documents in the collection and sorts the collection by ObjectId, and performs paginated skips to determine range boundaries. You can specify the maximum number of documents to skip each time during pagination.

Consider the following MongoDB collection with 15 documents and ObjectID in the _id field.

Example Source MongoDB Collection

The following JSON block is an example parallel load rule in the AWS DMS table mappings when creating a full load replication task to enable auto segmentation:

"parallel-load": {
    "type": "partitions-auto",
    "number-of-partitions": 3,
    "collection-count-from-metadata": "true",
    "max-records-skip-per-page": 1000000,
    "batch-size": 50000
}

The table mapping has the following parameters:

  • type – Setting type to partitions-auto is a required parameter to enable auto segmentation.
  • number-of-partitions – The total number of partitions (segments) that AWS DMS uses for the migration. This is an optional parameter with a default value of 16 and a min value of 1 and a max value of 49. The MaxFullLoadSubTasks value in the task settings must be greater than or equal to the number-of-partitions value that you specify. For more information, see Full-load task settings.
  • collection-count-from-metadata –When enabled, AWS DMS uses collection.estimatedDocumentCount() to fetch the estimated number of documents in the collection from the collection metadata. Setting this to false uses db.collection.count() to return the number of documents that match an empty query predicate. The document count is used to automatically segment the load across multiple threads. This parameter is optional with a default value of true.
  • max-records-skip-per-page – The maximum number of documents to skip at a time during segmentation. For each segment, the collection is sorted and paginated skips are iteratively applied to find the lower range boundary for the segment. This parameter is optional with a default value of 10000 and a min value of 10 and a max value of 5000000. A higher skip value is generally recommended for better performance, and a lower skip value is useful when the source latency is high or in the event of cursor timeouts.
  • batch-size – The maximum number of documents that AWS DMS retrieves in each batch of the response from the database. By default, the batch size value is 0, which means that AWS DMS uses the maximum batch size defined on the source MongoDB or Amazon DocumentDB server to retrieve the documents. This parameter is optional.

The following is an example table mapping for migrating the preceding SALES collection in the HR database:

{
    "rules": [
    {
        "rule-type": "selection",
        "rule-id": "1",
        "rule-name": "1",
        "object-locator": {
        "schema-name": "HR",
        "table-name": "SALES"
       },
        "rule-action": "include",
        "filters": []
    },
    {
        "rule-type": "table-settings",
        "rule-id": "2",
        "rule-name": "2",
        "object-locator": {
        "schema-name": "HR",
        "table-name": "SALES"
        },
        "parallel-load": {
        "type": "partitions-auto",
        "number-of-partitions": 3,
        "collection-count-from-metadata": "true",
        "max-records-skip-per-page": 100,
        "batch-size": 5000
        }
    }
    ]
}

Because the number of partitions provided is 3, three segment threads are provisioned for the source, each of which unloads five documents from the SALES collection. Because max-records-skip-per-page is 100, which is greater than the total number of documents that are skipped, AWS DMS skips 0, 5, and 10 documents for each segment to get the lower boundary for the segment.

The preceding configuration migrates the example data in three segments (three threads in parallel):

  • The first segment migrates five documents from the SALES collection with _id >= 610d86b7cbdda740c67de327 in the collection
  • The second segment migrates five documents with _id >= 610d86b7cbdda740c67de32c in the collection
  • The third segment migrates five documents with _id >= 610d86b7cbdda740c67de331 in the collection

After specifying the table mappings, create and start a migration task to migrate the data to the target. For more information, see Creating a task.

Limitations of auto segmentation

Auto Segmentation has the following limitations:

  • Because AWS DMS has to compute the segment boundaries through pagination and sorting the primary key _id, an overhead is associated with it.
  • Because AWS DMS uses the minimum _id of each segment as a boundary and to paginate based on, changing the minimum _id in the collection during segment boundary computation might lead to duplicate row errors or data loss. Make sure that the lowest _id in the collection remains constant.

For a full list of limitations, see Segmenting MongoDB collections and migrating in parallel and Segmenting Amazon DocumentDB collections and migrating in parallel.

Using Range Segmentation with AWS DMS

With range segmentation, you can specify the number of segments and boundaries within the table mapping rule of the replication task. Compared to auto segmentation, range segmentation has more flexibility in creating segments. If the document size doesn’t follow a uniform distribution in the collection or if the dataset is skewed, range segmentation enables you to provide disproportionate load for each segment. In addition, range segmentation supports partitions based on multiple fields when the values from two fields in combination follow a more straightforward pattern to segment. For MongoDB and Amazon DocumentDB source endpoints, the following field types are supported as the partition fields:

  • ObjectId
  • String
  • Integer (INT32 and INT64)
  • Double

The following JSON block is an example parallel load rule in the table mappings when creating a replication task to enable range segmentation:

"parallel-load": {
    "type": "ranges",
    "columns": [
        "_id",
        "REGION"
        ],
    "boundaries": [
    [
        "610d86b7cbdda740c67de32a",
        "NORTH"
    ],
    [
        "610d86b7cbdda740c67de330",
        "EAST"
    ]
    ]
}

The table mapping has the following parameters:

  • type – Valid for MongoDB and Amazon DocumentDB source endpoints. Setting type to ranges is a required parameter to enable range segmentation.
  • columns – The fields on which the filter query is constructed. This is a required parameter.
  • boundaries – The range boundaries for each segment. This is a required parameter.

For the same HR.SALES collection example we used earlier, the following JSON block is an example table mapping to migrate data using multi-fields range segmentation:

{
    "rules": [{
            "rule-type": "selection",
            "rule-id": "1",
            "rule-name": "1",
            "object-locator": {
                "schema-name": "HR",
                "table-name": "SALES"
            },
            "rule-action": "include"
        },
        {
            "rule-type": "table-settings",
            "rule-id": "2",
            "rule-name": "2",
            "object-locator": {
                "schema-name": "HR",
                "table-name": "SALES"
            },
            "parallel-load": {
                "type": "ranges",
                "columns": [
                    "_id",
                    "REGION"
                ],
                "boundaries": [
                    [
                        "610d86b7cbdda740c67de32a",
                        "NORTH"
                    ],
                    [
                        "610d86b7cbdda740c67de330",
                        "EAST"
                    ]
                ]
            }
        }
    ]
}

The preceding configuration migrates the example data in three segments (three threads in parallel):

  • The first segment migrates all documents with _id <= 610d86b7cbdda740c67de32a and REGION <= NORTH
  • The second segment migrates all documents with _id <= 610d86b7cbdda740c67de330 and REGION <= EAST and excludes documents with _id <= 610d86b7cbdda740c67de32a and REGION <= NORTH
  • The third segment migrates all documents excluding documents with _id <= 610d86b7cbdda740c67de330 and REGION <= EAST

After you specify the table mappings, create and start a migration task to migrate the data to the target. For more information, see Creating a task.

Limitations of range segmentation

The limitation of range segmentation is that it requires analysis and construction of boundaries while specifying table mapping rules. This can be tedious and more prone to human error if there are a large number of collections to be segmented or if you’re specifying a large number of segments on collections.

For a full list of limitations, see Segmenting MongoDB collections and migrating in parallel and Segmenting Amazon DocumentDB collections and migrating in parallel.

Summary

If you want to improve the performance of full load AWS DMS migrations from MongoDB or Amazon DocumentDB source endpoints without having to worry about boundaries for each segment, you can use auto segmentation for simplicity. If you have a skewed dataset, or want to specify boundaries for segments, you should use the range segmentation option.

In this post, we showed you how to use auto segmentation and range segmentation, and described the use cases and limitations for both. These features are designed to improve AWS DMS performance for a faster migration, and make it easier for you to migrate data to the target database.

For more information about this feature and our service, see the AWS DMS documentation. We also recommend reviewing Segmenting MongoDB collections and migrating in parallel and Segmenting Amazon DocumentDB collections and migrating in parallel.


About the Author

Sanketh Balakrishna is a Software Development Engineer with the Database Migration Service team at Amazon Web Services. He is passionate about building large scale distributed systems which help customers migrate their data across databases. He is the lead developer for the MongoDB and Amazon DocumentDB Auto Segmentation features.

Suwen Ge is a Software Development Engineer with the Database Migration Service team at Amazon Web Services. He works on the design and development of customer requested features primarily in MongoDB and AWS DocumentDB migration using DMS. He is the lead developer for the range segmentation feature for MongoDB migration.