How can I use AWS DMS to migrate data to Amazon S3 in Parquet format?

Last updated: 2021-01-18

How can I use AWS Database Migration Service (AWS DMS) to migrate data in Apache Parquet (.parquet) format to Amazon Simple Storage Service (Amazon S3)?

Resolution

Note: If you receive errors when running AWS Command Line Interface (AWS CLI) commands, make sure that you’re using the most recent AWS CLI version.

You can use AWS DMS to migrate data to an S3 bucket in Apache Parquet format if you use replication 3.1.3 or a more recent version. The default Parquet version is Parquet 1.0.

1.    Create a target Amazon SE endpoint from the AWS DMS Console, and then add an extra connection attribute (ECA), as follows. Also, check the other extra connection attributes that you can use for storing parquet objects in an S3 target.

dataFormat=parquet;

Or, create a target Amazon S3 endpoint using the create-endpoint command in the AWS Command Line Interface (AWS CLI):

aws dms create-endpoint --endpoint-identifier s3-target-parque --engine-name s3 --endpoint-type target --s3-settings '{"ServiceAccessRoleArn": <IAM role ARN for S3 endpoint>, "BucketName": <S3 bucket name to migrate to>, "DataFormat": "parquet"}'

2.    Use the following extra connection attribute to specify the Parquet version of output file:

parquetVersion=PARQUET_2_0;

3.    Run the describe-endpoints command to see if the S3 endpoint that you created has the S3 setting DataFormat or the extra connection attribute dataFormat set to "parquet". To check the S3 setting DataFormat, run a command similar to the following:

aws dms describe-endpoints --filters Name=endpoint-arn,Values=<S3 target endpoint ARN> --query "Endpoints[].S3Settings.DataFormat"
[
    "parquet"
]

4.    If the value of the DataFormat parameter is CSV, then recreate the endpoint.

5.    After you have the output in Parquet format, you can parse the output file by installing the Apache Parquet command line tool:

pip install parquet-cli --user

6.    Then, inspect the file format:

parq LOAD00000001.parquet 
 # Metadata 
 <pyarrow._parquet.FileMetaData object at 0x10e948aa0>
  created_by: AWS
  num_columns: 2
  num_rows: 2
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 169

7.    Finally, print the file content:

parq LOAD00000001.parquet --head
   i        c
0  1  insert1
1  2  insert2

Did this article help?


Do you need billing or technical support?