How can I control the number of CDC files being generated for my target S3 endpoint using AWS DMS?
Last updated: 2022-07-29
I want to control the number of change data capture (CDC) files that are being generated when I use Amazon Simple Storage Service (Amazon S3) as a target endpoint. How can I use AWS Database Migration Service (AWS DMS) to do this?
When you use Amazon S3 as a target endpoint for a full load and CDC, or a CDC only AWS DMS task, you can use a number of parameters to control the associated file size in the target S3 endpoint.
This article discusses these extra connection attributes (ECAs), and how you can use them to control the volume of CDC files generated on your Amazon S3 endpoint:
- cdcMaxBatchInterval - The maximum interval length condition, defined in seconds, to output a file to Amazon S3. The default value is 60 seconds.
- cdcMinFileSize - The minimum file size condition, defined in KB, to output a file to Amazon S3. The default value is 32000 KB.
- maxFileSize - The maximum size, in KB, of any .csv file to be created while migrating to an S3 target during full load. The default value is 1 GB.
- WriteBufferSize - The size, in KB, of the in-memory file write buffer used when generating .csv files on the local disk at the AWS DMS replication instance. The default value is 1000 KB.
The cdcMaxBatchInterval parameter controls the time interval for writing files to Amazon S3. When it uses the default value of 60 seconds, AWS DMS writes files into Amazon S3 every minute. Another important parameter is the cdcMinFileSize parameter, which determines the maximum size of the CDC file. When using the default value of 32 KB, AWS DMS writes into Amazon S3 every time it has 32 KB of change data.
The cdcMaxBatchInterval and cdcMinFileSize parameters work together. AWS DMS uses whichever parameter value is met first. So, with the default setup, AWS DMS writes file into Amazon S3 if it has either a minute of pending changes or 32 KB of data, depending on which happens first.
maxFileSize determines the max file size from S3 target output files for both CSV and Parquet formats. But, when writing into .parquet files, AWS DMS writes data in batches:
1. AWS DMS allocates a memory segment of 1024 KB, which is the default size for writeBufferSize.
2. Regardless of the value of maxFileSize, AWS DMS allocates at least one write buffer with a default size of 1 MB.
3. When AWS DMS finishes writing the first batch of data, it compares the current size of data against the maxFileSize. The data is written to a .parquet file in the target S3 bucket if the current size is greater than or equal to maxFileSize.
4. If you set the maxFileSize to 1 MB, then writeBufferSize, with a default value of 1 MB, meets the value of maxFileSize. This is because the condition is already met after one write buffer is allocated. So, if you decrease the value of writeBufferSize by setting it to less than 1 MB, the conditional check happens when the size of the data to be written is less than 1 MB. By doing this, you can decrease the size of the generated .parquet file.
Note: The WriteBufferSize parameter settings apply only to .parquet and not to .csv files.