AWS Database Blog
Announcing the support of Parquet data format in AWS DMS 3.1.3
Today AWS DMS announces support for migrating data to Amazon S3 from any AWS-supported source in Apache Parquet data format. This is one of the many new features in DMS 3.1.3. Many of you use the “S3 as a target” support in DMS to build data lakes. Then, you use this data with other AWS services like Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. You’ve requested a way to support migrations to S3 in different formats—and now that feature is available.
This post shows how you can configure a DMS task to migrate data in Parquet format to an S3 bucket and folder of your choice.
Overview
Apache Parquet is built to support efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows you to specify compression schemes on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. Now with AWS DMS 3.1.3, you can support migrations to S3 in the Parquet format.
Walkthrough
First, create an S3 target endpoint with the appropriate settings. There are two ways to do this with the required extra connection attributes for migrating the data in Parquet format: through the AWS CLI or the DMS console.
Create an S3 target endpoint using the AWS CLI
Modify the existing S3 endpoint to provide an extra connection attribute with the data format as Parquet, with the following command:
The result set looks like the following:
Make sure to use the latest version of the AWS CLI. If you use the older version, you see the following error:
Create an S3 target endpoint using the DMS console
Use the following steps to create the S3 target endpoint:
- In the DMS console, under Resource management, choose Endpoints, Create endpoint.
- Add the extra connection attribute as follows:
The endpoint configuration screen should resemble the one in the following screenshot.
For this example, Microsoft SQL Server is the source. Create a table on this source with the following command:
After inserting a few records into the table result set in SQL Server, it looks like the following:
personid | lastname | firstname | address | city |
1 | jordan | michael | 12345 street ave | chicago |
2 | potter | harry | 4567 terrace | seattle |
3 | smith | will | 8901 3rd ave | los angeles |
4 | staton | jason | 7896 street | las vegas |
Create a task with the previous target endpoint. After the task migration is complete, a Parquet file is created on an S3 bucket, as shown in the following screenshot.
You can copy the Parquet file into Amazon Redshift or query the file using Athena or AWS Glue. Learn how in the following sections.
Copy the Parquet file using Amazon Redshift
Copy the Parquet file into Amazon Redshift, connect to the Amazon Redshift cluster, and create the table using the same syntax from the SQL Server source as follows:
The table is now ready on Amazon Redshift. Run the copy command against the Amazon Redshift cluster using the following command:
Query the file using Athena
Now use Athena to retrieve the result set from the Parquet file. From the Athena console, create a table and provide the location of the input dataset using the following commands. In this example, the dataset is S3.
After you provide the required details and create the table, you can run a query and return the following result set.
personid | lastname | firstname | address | city |
1 | jordan | michael | 12345 street ave | chicago |
2 | potter | harry | 4567 terrace | seattle |
3 | smith | will | 8901 3rd ave | los angles |
4 | staton | jason | 7896 street | las vegas |
Conclusion
Using DMS 3.1.3, you can now configure a DMS task to migrate data in Parquet format to an S3 bucket. You can create an S3 target endpoint two different ways, either through AWS CLI or the AWS DMS console.
If you have questions or suggestions, please add your comments below. Good luck and happy migrating!
About the Author
Harish Bannai is a cloud support engineer at AWS. He works with customers to provide guidance and technical assistance on database projects, helping them to improve the value of their solutions when using AWS..