How can I resolve the error "Unknown dataset URI pattern: dataset" when exporting Amazon RDS data to Amazon S3 in Parquet format using Sqoop?

Last updated: 2020-05-11

I'm trying to use an Amazon EMR cluster to export Amazon Relational Database Service (Amazon RDS) data to Amazon Simple Storage Service (Amazon S3) in Apache Parquet format using Apache Sqoop. I'm using the --as-parquetfile parameter, but I keep getting this error:

"Check that JARs for s3a datasets are on the class path org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset."

Short Description

This error affects Sqoop version 1.4.7. To resolve the error, download and install the kite-data-s3-1.1.0.jar.

Resolution

Note: The following solution was tested on Amazon EMR release version 5.28.0 and Sqoop version 1.4.7.

1.    Connect to the master node using SSH.

2.    Use wget to download the kite-data-s3-1.1.0.jar:

[hadoop@ip-xxx-xx-xx-x]$ wget https://repo1.maven.org/maven2/org/kitesdk/kite-data-s3/1.1.0/kite-data-s3-1.1.0.jar

3.    Confirm that the downloaded file is the correct size (1.7 MB):

[hadoop@ip-xxx-xx-xx-x]$ du -h 
1.7M     /usr/lib/sqoop/lib/kite-data-s3-1.1.0.jar

4.    Move the JAR to the Sqoop library directory (/usr/lib/sqoop/lib/):

sudo cp kite-data-s3-1.1.0.jar /usr/lib/sqoop/lib/

5.    Grant permission on the JAR:

sudo chmod 755 kite-data-s3-1.1.0.jar

6.    Use the s3n connector to import the jar. If you use the s3 connector, you get the Unknown dataset URI pattern: dataset error.

sqoop import --connect jdbc:mysql://mysql.cdfqbesrukqe.eu-west-1.rds.amazonaws.com:8193/dev --username admin -P --table hist_root --target-dir "s3n://awsexamplebucket/sqoop_parquet/demo" --as-parquetfile -m 2 --split-by identifiers -- --schema onwatch

For more information about the Kite SDK dataset URI, see Dataset, View, and Repository URIs.


Did this article help you?

Anything we could improve?


Need more help?