My Apache Hive external metastore is hosted on an Amazon Relational Database Service (Amazon RDS) MySQL instance. I upgraded to a newer release version of Amazon EMR, and now my cluster doesn't launch. I get an error message similar to the following:

Error: Duplicate key name 'PCS_STATS_IDX' (state=42000,code=1061)

Hive stores the schema version in the metastore database. The Amazon EMR cluster won't launch because the provision-node script fails while trying to initialize the Hive schema using the older metastore schema version. To resolve this, upgrade the Hive metastore schema to the latest version. 

Note: To upgrade from Hive 0.x to Hive 2.x, you must complete the following procedure twice: once to upgrade from Hive 0.x to 1.x, and then again to upgrade from Hive 1.x to 2.x. It's not possible to upgrade directly from Hive 0.x to 2.x.

1.    Create a snapshot to back up the Amazon RDS instance that hosts your Hive metastore.

2.     Launch an Amazon EMR cluster with one master node and no core or task nodes. Choose the release version that you want to upgrade to. Don't configure the external metastore for Hive.

3.    Connect to the master node using SSH.

4.    Open the /etc/hive/conf/hive-site.xml file on the master node as root using sudo, and then change the following configuration keys and values:

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://<HOSTNAME OF YOUR EXTERNAL METASTORE DB>:3306/hive?createDatabaseIfNotExist=true</value>
</property>

<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value><USERNAME OF YOUR AMAZON RDS DB></value>
</property>

<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value><PASSWORD OF YOUR AMAZON RDS DB></value>
</property>

5.    Save the changes.

6.    Run the following command from the Hive command line interface (CLI) to upgrade the schema version: 

hive --service schemaTool -dbType mysql -upgradeSchema

7.    Execute the following command from the Hive CLI to verify the upgrade:  

hive --service schemaTool -dbType mysql -info

8.    Launch a new Amazon EMR cluster, and then choose the same release version that you chose in step 2.

Note: After you launch the new cluster, you can terminate the cluster that you created in step 2.

9.    Configure the external Hive metastore to point to your Amazon RDS instance.

10.    Start the Hive CLI, and then execute the show tables command to verify that your tables are visible.


Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center.

Published: 2018-10-22