How do I use a PostgreSQL database as the external metastore for Hive on Amazon EMR?

Last updated: 2019-10-10

How can I use an Amazon Relational Database Service (Amazon RDS) for PostgreSQL DB instance as my external metastore for Apache Hive on Amazon EMR?

Resolution

Before you begin, note the following:

  • This solution assumes that you already have an active PostgreSQL database.
  • If you are using Amazon EMR release version 5.7 or earlier, download the PostgreSQL JDBC driver. Then, add it to the Hive library path (/usr/lib/hive/lib). Amazon EMR release versions 5.8.0 and later come with the PostgreSQL JDBC driver in the Hive library path.

To configure a PostgreSQL DB instance as the external metastore for Hive:

1.    Create a PostgreSQL DB instance.

2.    Modify the DB instance security group to allow connections on port 5432 between your database and the ElasticMapReduce-master security group. For more information, see VPC Security Groups.

3.    Launch an Amazon EMR cluster without an external metastore (just the default MySQL database).

4.    Connect to the master node using SSH.

5.    Replace the Hive configuration with the following properties. Replace these values in the example:
mypostgresql.c6c8mwvfdgv0.us-west-2.rds.amazonaws.com: the endpoint of your DB instance
mypgdb: the name of your PostegreSQL database
database_username: the DB instance username
database_password: the DB instance password

[hadoop@ip-X-X-X-X ~]$ sudo vi /etc/hive/conf/hive-site.xml

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:postgresql://mypostgresql.c6c8mwvfdgv0.us-west-2.rds.amazonaws.com:5432/mypgdb</value>
    <description>PostgreSQL JDBC driver connection URL</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.postgresql.Driver</value>
    <description>PostgreSQL metastore driver class name</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>database_username</value>
    <description>the username for the DB instance</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>database_password</value>
    <description>the password for the DB instance</description>
  </property>

6.    Run the following commands to create the PostgreSQL schema:

[hadoop@ip-X-X-X-X ~]$ cd /usr/lib/hive/bin/
[hadoop@ip-X-X-X-X bin]$ ./schematool -dbType postgres -initSchema  
Metastore connection URL:     jdbc:postgresql://mypostgresql.c6c8mwvfdgv0.us-west-2.rds.amazonaws.com:5432/mypgdb
Metastore Connection Driver :     org.postgresql.Driver
Metastore connection User:     test
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.postgres.sql
Initialization script completed
schemaTool completed

7.    Stop and start Hive services so that the updated settings take effect:

[hadoop@ip-X-X-X-X bin]$ sudo initctl list |grep -i hive
hive-server2 start/running, process 11818
hive-hcatalog-server start/running, process 12708
[hadoop@ip-X-X-X-X9 bin]$ sudo stop hive-server2
hive-server2 stop/waiting
[hadoop@ip-X-X-X-X bin]$ sudo stop hive-hcatalog-server
hive-hcatalog-server stop/waiting
[hadoop@ip-X-X-X-X bin]$ sudo start hive-server2
hive-server2 start/running, process 18798
[hadoop@ip-X-X-X-X bin]$ sudo start hive-hcatalog-server
hive-hcatalog-server start/running, process 19614

8.    Log in to the Hive shell and then create a Hive table. In the following example, the table is called "test_postgres."

[hadoop@ip-X-X-X-X bin]$ hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: true
hive> show databases;
OK
default
Time taken: 0.569 seconds, Fetched: 1 row(s)
hive> create table test_postgres(a int,b int);
OK
Time taken: 0.708 seconds

9.    Install PostgreSQL:

[hadoop@ip-X-X-X-X bin]$ sudo yum install postgresql

10.   Connect to the PostgreSQL DB instance using the command line. Replace these values in the example:
mypostgresql.c6c8mwvfdgv0.us-west-2.rds.amazonaws.com: the endpoint of your DB instance
mypgdb: the name of your PostegreSQL database
database_username: the DB instance username

[hadoop@ip-X-X-X-X bin]$ psql --host=mypostgresql.c6c8mwvfdgv0.us-west-2.rds.amazonaws.com --port=5432 --username=database_username --password --dbname=mypgdb

11.   When prompted, enter the password for the DB instance.

12.   Run the following command to confirm that you can access the Hive table that you created earlier:

[hadoop@ip-X-X-X-X bin]$ select * from "TBLS";

 TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER  | RETENTION | SD_ID |   TBL_NAME    |   TBL_TYPE    | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | IS_REWRITE_ENABLED 
--------+-------------+-------+------------------+--------+-----------+-------+---------------+---------------+--------------------+--------------------+--------------------
      1 |  1555014961 |     1 |                0 | hadoop |         0 |     1 | test_postgres | MANAGED_TABLE |                    |                    | f
(1 row)

Your Amazon EMR cluster is now using the PostgreSQL database as the external metastore for Hive.