AWS Database Blog

Announcing the support of Parquet data format in AWS DMS 3.1.3

Today AWS DMS announces support for migrating data to Amazon S3 from any AWS-supported source in Apache Parquet data format. This is one of the many new features in DMS 3.1.3. Many of you use the “S3 as a target” support in DMS to build data lakes. Then, you use this data with other AWS services like Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. You’ve requested a way to support migrations to S3 in different formats—and now that feature is available.

This post shows how you can configure a DMS task to migrate data in Parquet format to an S3 bucket and folder of your choice.

Overview

Apache Parquet is built to support efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows you to specify compression schemes on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. Now with AWS DMS 3.1.3, you can support migrations to S3 in the Parquet format.

Walkthrough

First, create an S3 target endpoint with the appropriate settings. There are two ways to do this with the required extra connection attributes for migrating the data in Parquet format: through the AWS CLI or the DMS console.

Create an S3 target endpoint using the AWS CLI

Modify the existing S3 endpoint to provide an extra connection attribute with the data format as Parquet, with the following command:

aws dms modify-endpoint --endpoint-arn arn:aws:dms:us-west-2:249813873556:endpoint:NRK76ZBYH2RNLNMALZHSNJICE4 --s3-settings '{"ServiceAccessRoleArn": "arn:aws:iam::249813873556:role/S3-Target-DMS", "BucketName": "test-dms-s3-target", "DataFormat": "parquet"}'

The result set looks like the following:

{
    "Endpoint": {
        "Status": "active", 
        "S3Settings": {
            "CompressionType": "NONE", 
            "EnableStatistics": true, 
            "BucketFolder": "", 
            "CsvRowDelimiter": "\\n", 
            "ParquetVersion": "parquet-1-0", 
            "CsvDelimiter": ",", 
            "ServiceAccessRoleArn": "arn:aws:iam::249813873556:role/S3-Target-DMS", 
            "DataFormat": "parquet", 
            "BucketName": "test-dms-s3-target", 
            "EncodingType": "rle-dictionary"
        }, 
        "EndpointType": "TARGET", 
        "ServiceAccessRoleArn": "arn:aws:iam::249813873556:role/S3-Target-DMS", 
        "SslMode": "none", 
        "EndpointArn": "arn:aws:dms:us-west-2:249813873556:endpoint:NRK76ZBYH2RNLNMALZHSNJICE4", 
        "ExtraConnectionAttributes": "bucketName=test-dms-s3-target;compressionType=NONE;csvDelimiter=,;csvRowDelimiter=\\n;", 
        "EngineDisplayName": "Amazon S3", 
        "EngineName": "s3", 
        "EndpointIdentifier": "target-s3"
    }
}

Make sure to use the latest version of the AWS CLI. If you use the older version, you see the following error:

"Unknown parameter in S3Settings: "DataFormat", must be one of: ServiceAccessRoleArn, ExternalTableDefinition, CsvRowDelimiter, CsvDelimiter, BucketFolder, BucketName, CompressionType"

Create an S3 target endpoint using the DMS console

Use the following steps to create the S3 target endpoint:

  1. In the DMS console, under Resource management, choose Endpoints, Create endpoint.
  2. Add the extra connection attribute as follows:
compressionType=NONE;csvDelimiter=,;csvRowDelimiter=\n;DataFormat=parquet

The endpoint configuration screen should resemble the one in the following screenshot.

For this example, Microsoft SQL Server is the source. Create a table on this source with the following command:

create table  person (
  PersonID int NOT NULL PRIMARY KEY,
  LastName varchar(255),
  FirstName varchar(255),
  Address varchar(255),
  City varchar(255) 
);

After inserting a few records into the table result set in SQL Server, it looks like the following:

personid lastname firstname address city
1 jordan michael 12345 street ave chicago
2 potter harry 4567 terrace seattle
3 smith will 8901 3rd ave los angeles
4 staton jason 7896 street las vegas

Create a task with the previous target endpoint. After the task migration is complete, a Parquet file is created on an S3 bucket, as shown in the following screenshot.

You can copy the Parquet file into Amazon Redshift or query the file using Athena or AWS Glue. Learn how in the following sections.

Copy the Parquet file using Amazon Redshift

Copy the Parquet file into Amazon Redshift, connect to the Amazon Redshift cluster, and create the table using the same syntax from the SQL Server source as follows:

create table person (
  PersonID int,
  LastName varchar(255),
  FirstName varchar(255),
  Address varchar(255),
  City varchar(255) 
);

The table is now ready on Amazon Redshift. Run the copy command against the Amazon Redshift cluster using the following command:

COPY person
FROM 's3://bucketname/foldername/'
IAM_ROLE 'arn:aws:iam::<accountnumber>:role/<role_name>'
FORMAT AS PARQUET;

Query the file using Athena

Now use Athena to retrieve the result set from the Parquet file. From the Athena console, create a table and provide the location of the input dataset using the following commands. In this example, the dataset is S3.

  CREATE EXTERNAL TABLE `s3parquettable`(
  `personid` int, 
  `lastname` string, 
  `firstname` string, 
  `address` string, 
  `city` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<bucketname>/Foldername'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'transient_lastDdlTime'='1553697314')

After you provide the required details and create the table, you can run a query and return the following result set.

personid lastname firstname address city
1 jordan michael 12345 street ave chicago
2 potter harry 4567 terrace seattle
3 smith will 8901 3rd ave los angles
4 staton jason 7896 street las vegas

Conclusion

Using DMS 3.1.3, you can now configure a DMS task to migrate data in Parquet format to an S3 bucket. You can create an S3 target endpoint two different ways, either through AWS CLI or the AWS DMS console.

If you have questions or suggestions, please add your comments below. Good luck and happy migrating!

 


About the Author

 

Harish Bannai is a cloud support engineer at AWS. He works with customers to provide guidance and technical assistance on database projects, helping them to improve the value of their solutions when using AWS..