AWS Big Data Blog

Migrate External Table Definitions from a Hive Metastore to Amazon Athena

by Neil Mukerje | on | | Comments

For customers who use Hive external tables on Amazon EMR, or any flavor of Hadoop, a key challenge is how to effectively migrate an existing Hive metastore to Amazon Athena, an interactive query service that directly analyzes data stored in Amazon S3. With Athena, there are no clusters to manage and tune, and no infrastructure to set up or manage. Customers pay only for the queries they run.

In this post, I discuss an approach to migrate an existing Hive metastore to Athena, as well as how to use the Athena JDBC driver to run scripts. I demonstrate two scripts.

  1. The first script exports external tables from a Hive metastore on EMR, or other Hadoop flavors, as a Hive script. This script handles both Hive metastores local to the cluster or metastores stored in an external database.
  1. The second script executes the Hive script in Athena over JDBC to import the external tables into the Athena catalog.

Both scripts are available in the aws-blog-athena-importing-hive-metastores GitHub repo.

Prerequisites

You must have the following resources available:

  • A working Python 2.7+ environment. (required for the first script)
  • A working Java 1.8 runtime environment
  • Groovy, if not already installed
  • The Java classpath set to point to the Athena JDBC driver JAR file location

In EMR, you can use the following commands to complete the prerequisites (Python comes already installed):

# set Java to 1.8
EMR $> export JAVA_HOME=/usr/lib/jvm/java-1.8.0

# Download Groovy and set Groovy binary in PATH
EMR $> wget https://dl.bintray.com/groovy/maven/apache-groovy-binary-2.4.7.zip
EMR $> unzip apache-groovy-binary-2.4.7.zip
EMR $> export PATH=$PATH:`pwd`/groovy-2.4.7/bin/:

# Download latest Athena JDBC driver and set it in JAVA CLASSPATH
EMR $> aws s3 cp s3://athena-downloads/drivers/AthenaJDBC41-1.0.0.jar .
EMR $> export CLASSPATH=`pwd`/AthenaJDBC41-1.0.0.jar:;

Exporting external tables from a Hive metastore

The Python script exportdatabase.py exports external tables only from the Hive metastore, and saves them to a local file as a Hive script.

EMR $> python exportdatabase.py <<Hive database name>> 

Here’s the sample output:

EMR $> python exportdatabase.py default

Found 10 tables in database...

Database metadata exported to default_export.hql.

Athena does not support every data type and SerDe supported by Hive. Edit or replace contents in the generated Hive script as needed to ensure compatibility. For more information about supported datatypes and SerDes, see the Amazon Athena documentation.

Importing the external tables into Athena

The Groovy script executescript.gvy connects to Athena and executes the Hive script generated earlier. Make sure that you update the access key ID and secret access key, and replace ‘s3_staging_dir’ with an Amazon S3 location. Also, make sure that the target database specified in the command exists in Athena.

EMR $> groovy executescript.gvy <<target database in Athena>> <<Hive script file>>

Here’s the sample output:

$ groovy executescript.gvy playdb default_export.hql 

Found 2 statements in script...

1. Executing :DROP TABLE IF EXISTS `nyc_trips_pq`

result : OK

2. Executing :
CREATE EXTERNAL TABLE `nyc_trips_pq`(
  `vendor_name` string,
  `trip_pickup_datetime` string,
  `trip_dropoff_datetime` string,
  `passenger_count` int,
  `trip_distance` float,
  `payment_type` string,
  `are_amt` float,
  `surcharge` float,
  `mta_tax` float,
  `tip_amt` float,
  `tolls_amt` float,
  `total_amt` float)
PARTITIONED BY (
  `year` string,
  `month` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
  LOCATION
  's3://bucket/dataset2'
TBLPROPERTIES (
  'parquet.compress'='SNAPPY',
  'transient_lastDdlTime'='1478199332')

result : OK

The executescript.gvy script can be used to execute any Hive script in Athena. For example, you can use this script to add partitions to an existing Athena table that uses a custom partition format.

You can save the following SQL statements to a script, named addpartitions.hql:

ALTER TABLE default.elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01') location 's3://athena-examples/elb/raw/2015/01/01';
ALTER TABLE default.elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='02') location 's3://athena-examples/elb/raw/2015/01/02';
ALTER TABLE default.elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='03') location 's3://athena-examples/elb/raw/2015/01/03';
ALTER TABLE default.elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='04') location 's3://athena-examples/elb/raw/2015/01/04';

Run the following command to execute addpartitions.hql:

EMR $> groovy executescript.gvy default addpartitions.hql 

Found 4 statements in script...

1. Executing :ALTER TABLE default.elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01') location 's3://athena-examples/elb/raw/2015/01/01'

result : OK

2. Executing :
ALTER TABLE default.elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='02') location 's3://athena-examples/elb/raw/2015/01/02'

result : OK

3. Executing :
ALTER TABLE default.elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='03') location 's3://athena-examples/elb/raw/2015/01/03'

result : OK

4. Executing :
ALTER TABLE default.elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='04') location 's3://athena-examples/elb/raw/2015/01/04'

result : OK

Two samples—sample_createtable.hql and sample_addpartitions.hql—are included in the aws-blog-athena-importing-hive-metastores GitHub repo so that you can test the executescript.gvy script. Run them using the following commands to create the table and add partitions to it in Athena:

$> groovy executescript.gvy default sample_createtable.hql
$> groovy executescript default sample_addpartitions.hql

Summary

In this post, I showed you how to migrate an existing Hive metastore to Athena as well as how to use the Athena JDBC driver to execute scripts. Here are some ways you can extend this script:

  • Automatically discover partitions and add partitions to migrated external tables in Athena.
  • Periodically keep a Hive metastore in sync with Athena by applying only changed DDL definitions.

I hope you find this post useful and that this helps accelerate your Athena migration efforts. Please do share any feedback or suggestions in the comments section below.


About the Author

Neil_Mukerje_100Neil Mukerje is a Solutions Architect with AWS. He guides customers on AWS architecture and best practices. In his spare time, he ponders on the cosmic significance of everything he does.

 

 


Related

Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight

o_IoT_Minutes_11