AWS Big Data Blog
Migrate External Table Definitions from a Hive Metastore to Amazon Athena
For customers who use Hive external tables on Amazon EMR, or any flavor of Hadoop, a key challenge is how to effectively migrate an existing Hive metastore to Amazon Athena, an interactive query service that directly analyzes data stored in Amazon S3. With Athena, there are no clusters to manage and tune, and no infrastructure to set up or manage. Customers pay only for the queries they run.
In this post, I discuss an approach to migrate an existing Hive metastore to Athena, as well as how to use the Athena JDBC driver to run scripts. I demonstrate two scripts.
- The first script exports external tables from a Hive metastore on EMR, or other Hadoop flavors, as a Hive script. This script handles both Hive metastores local to the cluster or metastores stored in an external database.
- The second script executes the Hive script in Athena over JDBC to import the external tables into the Athena catalog.
Both scripts are available in the aws-blog-athena-importing-hive-metastores GitHub repo.
Prerequisites
You must have the following resources available:
- A working Python 2.7+ environment. (required for the first script)
- A working Java 1.8 runtime environment
- Groovy, if not already installed
- The Java classpath set to point to the Athena JDBC driver JAR file location
In EMR, you can use the following commands to complete the prerequisites (Python comes already installed):
Exporting external tables from a Hive metastore
The Python script exportdatabase.py exports external tables only from the Hive metastore, and saves them to a local file as a Hive script.
Here’s the sample output:
Athena does not support every data type and SerDe supported by Hive. Edit or replace contents in the generated Hive script as needed to ensure compatibility. For more information about supported datatypes and SerDes, see the Amazon Athena documentation.
Importing the external tables into Athena
The Groovy script executescript.gvy connects to Athena and executes the Hive script generated earlier. Make sure that you update the access key ID and secret access key, and replace ‘s3_staging_dir’ with an Amazon S3 location. Also, make sure that the target database specified in the command exists in Athena.
Here’s the sample output:
The executescript.gvy script can be used to execute any Hive script in Athena. For example, you can use this script to add partitions to an existing Athena table that uses a custom partition format.
You can save the following SQL statements to a script, named addpartitions.hql:
Run the following command to execute addpartitions.hql:
Two samples—sample_createtable.hql and sample_addpartitions.hql—are included in the aws-blog-athena-importing-hive-metastores GitHub repo so that you can test the executescript.gvy script. Run them using the following commands to create the table and add partitions to it in Athena:
Summary
In this post, I showed you how to migrate an existing Hive metastore to Athena as well as how to use the Athena JDBC driver to execute scripts. Here are some ways you can extend this script:
- Automatically discover partitions and add partitions to migrated external tables in Athena.
- Periodically keep a Hive metastore in sync with Athena by applying only changed DDL definitions.
I hope you find this post useful and that this helps accelerate your Athena migration efforts. Please do share any feedback or suggestions in the comments section below.
About the Author
Neil Mukerje is a Solutions Architect with AWS. He guides customers on AWS architecture and best practices. In his spare time, he ponders on the cosmic significance of everything he does.